MultiModal CoT – Paper of the Day

The following is about the paper that most appealed to me today: Multimodal Chain-of-Thought Reasoning in Language Models

This is a very interesting paper claiming super human performance on a specific task with a relatively small language model. Of course this must always be couched with the warning that the language model is small for the current era, as these language models would all have been considered massive only a handful of years prior. The paper uses a dataset that poses scientific questions and then asks a model to generate an answer and rationale for that answer. The questions posed contain a textual question, an image and a textual caption of that image.

Previous attempts covered by the paper dealt with language models that used the question text and the caption while ignoring the image as they were language models. This appears to have had far worse results than the paper’s model which include a vision transformer that takes the image into account. The possible reasoning behind this is discussed at depth and I felt quite well. Overall I think this is a fantastic paper and one that unfortunately today I don’t have a lot of time to go into but I would highly recommend it.

Obviously the code behind this paper being available would make it highly attractive for any data scientist dealing with similar problems that have images available. One could also imagine the possibility of creating a language model that can give better answers when being presented with simple diagrams of the problem it is trying to solve.

My simplistic rating system is below:

Explanation: 8/10
Novelty: 7/10
Breakthrough: 4/10
Interest: 9/10





Leave a Reply

Your email address will not be published. Required fields are marked *