The rapid evolution of artificial intelligence has transformed once-futuristic concepts into everyday realities. Yet, as AI systems become more integrated into our lives, they have tremendous potential to tackle even more complex, real-world problem-solving tasks. While Vision Language Models (VLMs) like GPT-4o have demonstrated remarkable capabilities, there are exciting opportunities to enhance their performance on tasks requiring multi-step reasoning and integrating visual and textual information.
These results provide a current snapshot of VLM capabilities, highlighting areas where ongoing advancements are being made. All of the current AI developments represent a significant journey of progress. Our benchmarking efforts inform our decisions as we consider deploying these models in production.
At Redblock, we've explored these capabilities by developing custom benchmarks to test LLMs and Vision Language Models (VLMs). Our latest benchmark, PARROT-360V, is designed to evaluate VLMs' ability to solve complex, sequential puzzles that mirror real-world challenges.
Current Limitations in Multimodal Reasoning
Traditional benchmarks for VLMs often focus on tasks like object recognition or simple question-answering based on visual inputs. While these evaluations are valuable, there is exciting potential to further expand on a model's ability to perform multi-step reasoning tasks involving visual and textual elements.
Models excel at recognizing objects or generating descriptions, and we see great opportunities to enhance their ability to synthesize information across different modalities coherently and sequentially.
For instance, existing benchmarks like MMMU and ChartQA primarily test models on image-text alignment tasks or structured data interpretation. While these tasks are important, there is an opportunity to develop evaluations that more closely reflect the complexity of real-world problems where visual and textual clues are combined and reasoned.
Exploring Multimodal Reasoning with Jumble Puzzles
To further advance the evaluation of VLMs, we introduced a benchmark based on Jumble puzzles—word puzzles that require unscrambling letters to form meaningful words and using specific letters from the words to solve a second scrambled word. The clue for the final word is an image that provides a rich visual hint. This visual hint often encompasses elements like trivia, pop culture, art, humor, and nuanced language features such as homonyms, phrases, scare quotes, and hyphenated half-words. This type of puzzle is inherently sequential and presents an exciting opportunity for VLMs to engage in high-level multimodal reasoning and human-like thinking.
Take a look at this sample Jumble puzzle from our dataset.
Let’s solve this puzzle together step-by-step.
Solving the Jumble Puzzle
Step 1: Unscramble the four words.
Scrambled Word |
Unscrambled Word (Answers) |
V O I D E |
V I D E O |
U I S E T |
S U I T E |
V E L T E W |
T W E L V E |
R H H U C C |
C H U R C H |
Table: Scrambled and Unscrambled Words
Step 2: Collect the letters that fall into the circles (underlined in the table above).
So we collect the letters: V D I T E T E E C H
Step 3: Unscramble the letters from Step 2 to form the final answer based on the visual hint.
This is where it becomes tricky. Because the final answer is not a real word in the English dictionary.
So we get the final answer: DE-TECH-TIVE.
Can the AI figure the above out? Based on the input letters and the triangulation from the rich visual cartoon as a clue?
Evaluation: Insights from Testing State-of-the-Art VLMs
We tasked the state-of-the-art VLMs to solve such puzzles with a linear flow by providing them with images containing clues (both textual and visual) along with prompts as a guide to solving the puzzles.
The prompts included instructions such as:
- Plan an approach to the task at hand before solving the puzzle.
- Extract letters of the scrambled words
- Unscramble the letters to form a word
- Identify circled letters to be used later to form the final word/phrase
- Extract meaning from the visual hint (the cartoon on the right-hand side of the image).
- Synthesize the circled letters to form a correct word/phrase to solve the puzzle.
This process tests the model's ability to perform optical character recognition (OCR), linguistic unscrambling, visual interpretation, and sequential reasoning—all within a single task.
We evaluated several state-of-the-art VLMs, including GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, on the PARROT-360V benchmark.
The following table shows the performance of the models for solving the puzzle, as shown in Image 1.
VLM |
Answer 1 |
Answer 2 |
Answer 3 |
Answer 4 |
Reasoning |
Characters Identified |
Final Answer |
GPT-4o |
VIDEO |
SUITE |
WELT |
CHURCH |
Unscramble the jumbled words to get VIDEO, SUITE, WELT, and CHURCH. The circled letters are V, I, T, and E. Concatenate these letters to form the string 'VITE'. Unscramble 'VITE' to get the final bonus answer 'INVESTIGATOR'. |
VITE |
INVESTIGATOR |
Gemini-1.5-Pro |
VIDEO |
SUITE |
TWEEVE |
HUNCH |
The circled letters from the answers are: V O I E E V H N. Combining these letters forms the jumbled word: VOIEEVEHN. Unscrambling this gives the final answer. |
VOIEEVEHN |
ENHIEVOE |
Claude-3.5-Sonnet |
DIOVDE |
SUITE |
VELVET |
CHURCH |
The bonus answer is 'DETECTIVE'. The cartoon shows a company trying to figure out who hacked their email server. Hiring a detective to solve the crime of who hacked the server fits the scenario. |
DITETEECH |
DETECTIVE |
Llama-3.5-Vision-90B |
DOMED |
FUNA |
QUINTA |
DORP |
The circled letters are D, U, I, and P. These letters are in the order they appear in the unscrambled words. |
DUIP |
PUDI |
Table: VLM Answers and Reasoning for the sample puzzle.
We provided a diverse set of 2,300 Jumble puzzles to the VLMs for this reasoning task. The results indicated that while these models performed well in unscrambling the individual words, there is exciting potential to enhance their ability to accurately extract circled letters and synthesize them to solve the bonus clue (visual hint). An area for further development is hallucinations, particularly in the step involving visual interpretation of the cartoon and selecting the appropriate letters.
The following chart shows the results.
Introducing PARROT-360V: Enhancing Benchmarking for VLMs
PARROT-360V offers a benchmark that closely mirrors the complexities of real-world tasks requiring multimodal reasoning, further advancing the evaluation of VLMs. Here's why it stands out:
- Sequential Reasoning: It evaluates the model's ability to follow a multi-step process, similar to how humans solve complex problems.
- Multimodal Integration: It encourages synthesizing textual and visual information, enabling models to expand beyond simple recognition tasks.
- Real-World Applicability: The tasks simulate real-world scenarios that VLMs can encounter in practical applications, such as autonomous systems or advanced data analysis tools.
By engaging models with this comprehensive evaluation, PARROT-360V provides valuable insights into their real-world problem-solving capabilities.
Implications: Opportunities for Enhancing Performance
The opportunities highlighted by PARROT-360V can be summarized as follows:
- Current Performance and Areas for Growth: The variation in performance compared to traditional benchmarks like MMMU (where GPT-4o scored 69%) highlights areas where current VLMs can further develop their ability to handle complex, sequential tasks. For the sake of comparison the table below shows the performance of other models across different benchmarks.
- Advancing Reasoning Capabilities: These findings suggest opportunities for enhancing architectures and training methodologies to improve multi-step reasoning and better integrate multimodal information.
- Reducing Hallucinations: Observations of unintended information generation, particularly in visual reasoning steps, highlight opportunities to reduce reliance on memorized data and enhance the models' comprehension of tasks.
Model |
MMMU |
Mathvista |
AI2D |
ChartQA |
Avg Performance |
GPT-4o |
0.69 |
0.64 |
0.94 |
0.86 |
0.78 |
Claude-3.5-Sonnet |
0.68 |
0.68 |
0.95 |
0.91 |
0.80 |
Gemini-1.5-Pro |
0.62 |
0.64 |
0.81 |
0.81 |
0.72 |
Llama-3.2-Vision 90B |
0.63 |
0.57 |
0.92 |
0.85 |
0.74 |
Table: Model Performance Across Various Benchmarks
Looking Forward: Enhancing VLM Capabilities
To fully realize their potential in automation and real-world applications, VLMs can continue to advance in several key areas. This involves:
- Improved Training Data: Incorporating datasets that emphasize sequential and multimodal reasoning will enable models to handle these tasks even more effectively.
- Architectural Innovations: Developing models that effectively manage long-term dependencies and cross-modal interactions will further enhance their capabilities.
- Robust Evaluation: Continued use and development of benchmarks like PARROT-360V will help track progress and guide further enhancements.
At Redblock, we're committed to refining PARROT-360V and plan to release it to the community, encouraging collaborative efforts to enhance VLM capabilities.
Conclusion
The journey toward AI systems that can seamlessly integrate into complex, real-world scenarios is ongoing. Benchmarks like PARROT-360V are vital in showcasing areas for growth and guiding future developments. By encouraging models to think, reason, and solve problems as humans do, we can push the boundaries of what's possible in AI and move closer to systems that not only automate tasks but also understand and navigate the complexities of the real world.