Introduction
Large Language Models (LLMs) have become pivotal in driving advancements across various fields, from customer service to data analysis. As their influence grows, so does the need to ensure these models perform optimally across different tasks. We’ve discussed this at great length, you might find it helpful to start here if you’re new to LLM benchmarking.
Why Benchmarking Matters in LLM Development?
Benchmarking provides a structured way to assess and compare LLMs, allowing us to understand their strengths, weaknesses, and suitability for specific applications. At Redblock, we've taken this further by introducing a new dataset that pushes the limits of LLM benchmarking, particularly in question-answering (QA) tasks [1].
We believe, benchmarking LLMs should be more than just assessing accuracy; it’s about understanding how well a model can adapt to different contexts, and respond to unforeseen challenges. Our benchmarking framework, Performance Assessment of Reasoning and Responses On Trivia (PARROT), focuses on several key metrics beyond simple accuracy, which will be covered later in this blog. Our metric provides a multi-faceted view of how LLMs operate under various conditions, offering deeper insights into their real-world applicability.
Why do we need a new benchmark?
Existing benchmarks often fall short of providing a comprehensive evaluation of LLMs for several reasons:
- Over-Reliance on Simple Metrics: Many benchmarks heavily rely on a single metric like accuracy or F1-score. While these metrics are useful, they can lead to models that are finely tuned for specific types of errors or tasks but may not generalize well across different applications [2][3].
- Surface-Level Evaluation: Traditional benchmarks evaluate models on tasks that may not fully challenge their reasoning abilities or adaptability. Such benchmarks often over-represent simpler tasks, allowing models to score well without truly testing their depth of understanding [2][3][4].
- Lack of Contextual Understanding: Many existing benchmarks do not adequately test these aspects, focusing instead on tasks that require straightforward, fact-based answers. This gap means that models might perform well in controlled benchmark environments but struggle in more dynamic, real-world situations [3][4].
- Few-Shot Learning Bias: Many benchmarks showcased utilize few-shot learning, which can give a false sense of a model’s capabilities. While few-shot learning allows models to adapt to new tasks with minimal examples, it doesn’t necessarily reflect their inherent understanding or reasoning skills. This can lead to overestimating a model’s performance in zero-shot or real-world scenarios where such guidance isn’t available [5].
PARROT: A New Dimension in QA-Benchmarking
At Redblock, we’ve gone beyond traditional metrics like Exact Match and F1 Score to evaluate LLM performance in unique contexts. It includes two different metrics namely, the Millionaire Metric and the Jeopardy Metric, both tailored to capture the intricacies of question-answering (QA) in formats inspired by popular television game shows. The aim was to create a benchmark that measures accuracy and accounts for factors like question difficulty and response correctness in a way that reflects real-world challenges.
The Millionaire Metric focuses on scenarios where questions are progressively more difficult and require precise, often fact-based answers, simulating environments where each decision carries significant weight. On the other hand, the Jeopardy Metric assesses how well models can handle questions that require deeper reasoning, contextual understanding, and the ability to navigate ambiguity, reflecting situations where flexibility and the ability to infer from incomplete information are of utmost importance.
Designing the Millionaire Metric
The Millionaire Metric was inspired by the format of "Who Wants to Be a Millionaire?" where contestants answer questions with increasing difficulty [6]. The core idea behind this metric is that not all questions should be weighted equally; answering a more difficult question correctly should contribute more to the overall score than answering an easier one. Here’s how we designed this metric:
- Question Difficulty and Elimination Ratio
- We categorized questions into 15 levels, each corresponding to the difficulty typically seen in the game show. Questions at higher levels are more challenging, so an LLM’s performance at these levels should be given more weight.
- To quantify this, we calculated an elimination ratio for each question level, which reflects how many contestants were eliminated at each game stage.
- This ratio helps determine the weight of each level in the final score.
- Weight Coefficients
- We standardized these elimination ratios across all seasons of the show to develop a weight coefficient for each question level.
- This coefficient ensures that the contribution of a correct answer to the final score is proportional to the difficulty of the question.
- Performance Calculation
- The final Millionaire Metric score is calculated by multiplying the weight coefficient of each level by the LLM’s accuracy at that level. The sum of these products across all 15 levels gives the overall performance score.
- This approach allows us to measure how well an LLM performs under increasing pressure, mimicking the real-world stakes of decision-making under uncertainty.
LLM |
Millionaire Score |
GPT-4o |
0.75 |
Claude-3.5-Sonnet |
0.75 |
Mistral (7.2B) |
0.67 |
Gemini-flash-1.5 |
0.61 |
Gemma-2 (9B) |
0.57 |
Llama-3.1 (8B) |
0.54 |
Qwen-2 (7.6B) |
0.55 |
Llama-3 (8B) |
0.50 |
Gemma (7B) |
0.42 |
Phi-3 (3.8B) |
0.30 |
Phi-3.5-mini (3.8B) |
0.27 |
Table: Millionaire Scores for Various Models
Designing the Jeopardy Metric
The Jeopardy Metric was developed to assess how well LLMs can handle questions that require deeper reasoning and the ability to manage ambiguity. Jeopardy’s unique format, where clues are presented with varying monetary values, served as the foundation for this metric. Here’s the design process:
- Difficulty Levels Based on Game Rounds
- Jeopardy is divided into three rounds: Jeopardy, Double Jeopardy, and Final Jeopardy [7].
- Each round has clues with different difficulty levels, reflected by their monetary value.
- In our metric, we associated these levels with specific coordinates on the game board and their respective rounds, creating a structured difficulty gradient from 0 (easiest) to 11 (most difficult).
- Handling Non-Elimination
- Unlike the Millionaire show, Jeopardy does not eliminate contestants during the game, which means difficulty must be assessed differently.
- We used the difficulty level of each clue and the associated category to gauge how challenging a question is, independent of contestant elimination.
- Performance at Each Level
- Similar to the Millionaire Metric, the Jeopardy Metric evaluates LLM performance by calculating accuracy at each difficulty level.
- However, the weighting is more nuanced, reflecting the varying difficulty within a single game round.
- The final Jeopardy Metric score is the sum of the weighted accuracies across all levels, providing a comprehensive view of the LLM’s ability to handle complex and ambiguous questions.
Model |
Jeopardy Score |
GPT-4o |
0.86 |
Claude-3.5-Sonnet |
0.85 |
Gemini-flash-1.5 |
0.73 |
Gemma-2 (9B) |
0.66 |
Mistral (7.2B) |
0.63 |
Llama-3.1 (8B) |
0.58 |
Llama-3 (8B) |
0.56 |
Qwen-2 (7.6B) |
0.50 |
Gemma (7B) |
0.47 |
Phi-3 (3.8B) |
0.37 |
Phi-3.5-mini (3.8B) |
0.11 |
Table: Jeopardy Scores for Various Models
The PARROT Score is a composite metric that reflects an LLM's performance across two subsets: PARROT-Jeopardy and PARROT-Millionaire. These non-overlapping datasets are uniquely designed to align with different types of QA tasks in Natural Language Processing (NLP). The mean performance of an LLM over these distinct subsets serves as the PARROT Score, providing a holistic view of the model's QA capabilities.
Model |
PARROT |
MMLU |
IFeval |
BBH |
GPQA |
GPT-4o |
0.81 |
0.88 |
0.90 |
UNK |
0.53 |
Claude-3.5-Sonnet |
0.80 |
0.81 |
0.92 |
0.93 |
0.59 |
Gemini-flash-1.5 |
0.67 |
0.79 |
UNK |
0.88 |
0.59 |
Mistral (7.2B) |
0.65 |
0.71 |
0.54 |
0.25 |
0.28 |
Gemma-2 (9B) |
0.62 |
0.69 |
0.74 |
0.42 |
0.36 |
Llama-3.1 (8B) |
0.56 |
0.68 |
0.78 |
0.30 |
0.27 |
Llama-3 (8B) |
0.53 |
0.68 |
0.47 |
0.26 |
0.29 |
Qwen-2 (7.6B) |
0.53 |
0.69 |
0.56 |
0.37 |
0.31 |
Gemma (7B) |
0.45 |
0.64 |
0.38 |
0.12 |
0.29 |
Phi-3 (3.8B) |
0.33 |
0.69 |
0.54 |
0.37 |
0.34 |
Phi-3.5-Mini (3.8B) |
0.19 |
0.69 |
0.57 |
0.36 |
0.33 |
Table: PARROT Benchmark Scores
How is PARROT different?
PARROT sets itself apart by not only testing LLMs on direct, factual questions but also on those with implicit challenges embedded in the format of how a question is posed. Many existing benchmarks focus on explicit knowledge retrieval, where the answer is a direct fact, but trivia questions often involve implicit reasoning, such as understanding wordplay, interpreting subtle hints, or drawing connections between seemingly unrelated concepts. For instance, in PARROT-Jeopardy, questions are often framed in reverse, requiring the model to interpret the clue and supply the correct question. Similarly, in PARROT-Millionaire, questions can involve making a decision towards similar choices [8]. It uniquely tests an LLM's ability to handle more complex and less structured queries, setting it apart from traditional benchmarks.
1. Novelty in Curation of Weights and Coefficients
PARROT introduces a unique approach by assigning weights based on question difficulty, unlike most benchmarks that treat all tasks equally.
- Millionaire Metric: Uses elimination ratios to weight questions, ensuring tougher questions contribute more to the final score.
- Jeopardy Metric: Reflects difficulty through monetary values associated with game rounds, requiring models to navigate complex reasoning tasks.
This weighted scoring offers a more realistic evaluation of a model’s ability to handle real-world challenges. And reinstates the fact that answers have consequences in the real world and PARROT is here to capture it.
2. Size and Scope of the PARROT Dataset
With nearly 84,000 samples, PARROT provides a deep and varied evaluation, reducing the risk of score inflation seen in smaller benchmarks.
3. Rigorous Zero-Shot Evaluation
PARROT evaluates models in a zero-shot context, unlike many benchmarks that use few-shot learning. This approach ensures models are tested on their inherent reasoning skills, providing a purer measure of their true abilities without prior guidance.
In automation and other high-stakes environments, the ability to reason and handle ambiguity is not just an advantage; it's a necessity. PARROT challenges models with the kind of complex, context-dependent questions that they are likely to encounter in real-world applications, making it a more accurate and reliable benchmark for evaluating an LLM's true capabilities.
Contributing to the Community
“We believe in the power of community-driven innovation. We’re excited to share our PARROT benchmark and other metrics with the wider AI community, inviting researchers and developers to collaborate with us in pushing the boundaries of LLM evaluation.
As we continue to refine our benchmarks and develop new ones, we’re committed to staying at the forefront of AI evaluation. Our goal is to create tools and frameworks that not only assess LLM performance but also drive the next generation of AI models. We look forward to seeing how these benchmarks are adopted and adapted by the community in the years to come.”
- Redblock AI Team.
Bibliography
- Redblock AI Team. "PARROT: Performance Assessment of Reasoning and Responses on Trivia." Redblock, 2024. https://huggingface.co/datasets/redblock/parrot↩︎
- Beeson, L. "LLM Benchmarks." GitHub, 2023.https://github.com/leobeeson/llm_benchmarks↩︎
- McIntosh, T. R. et. al. (2024). Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence. arXiv. https://arxiv.org/abs/2402.09880↩︎
- Gema, A. P., et. al. (2024). Are We Done with MMLU? arXiv. https://arxiv.org/abs/2406.04127↩︎
- Anthropic. Claude 3.5 Sonnet: A new chapter in AI development. https://www.anthropic.com/news/claude-3-5-sonnet↩︎
- Wikipedia. "Who Wants to Be a Millionaire (American game show)." Wikipedia, The Free Encyclopedia, 2024. https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show)s↩︎
- Encyclopaedia Britannica. "Jeopardy! (American television game show)." Encyclopaedia Britannica, 2024. https://www.britannica.com/topic/Jeopardy-American-television-game-show↩︎
- RedBlock AI. "Introducing PARROT–A New LLM Benchmark using Television Trivia Datasets." RedBlock AI, https://www.redblock.ai/blog/parrot-benchmark↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.