We have been experimenting with evaluating LLMs more effectively. The standard metrics, like accuracy or F1 scores, are good for some things but don’t tell us how well models reason through tough questions—especially those involving ambiguity or context. So, we decided to take a different approach: we turned to game shows.
The Problem with Current LLM Benchmarks
Most benchmarks (frameworks to assess the performance of an LLM) out there focus on whether a model can get the right answer to the question posed, but that’s just one piece of the puzzle. What happens when the model has to navigate complex or ambiguous situations or a differently phrased question? How does it adapt when things get harder? Those were the questions we wanted to answer, and existing benchmarks weren’t cutting it.
We needed something mimicking real-world decision-making and reasoning, not just fact recall. That’s when we thought of using game show trivia, which challenges both factual knowledge and the ability to reason through increasingly difficult questions.
What We Built: PARROT
Performance Assessment of Reasoning and Responses On Trivia (PARROT) is a combination of two datasets:
- PARROT-Millionaire – The key here is that the questions get progressively harder, and we weigh them based on how many contestants historically got them wrong. It’s a way of simulating real-world pressure where making a tough decision has a bigger impact.
- PARROT-Jeopardy – This subset is a whole different beast. It’s less about simple facts and more about reasoning through clues. We used the show’s dollar-value clues to weigh each question, so tougher clues (with higher dollar values) count for more in the model’s final score.
Why Game Shows?
Game shows provide an ideal platform for testing LLMs, and here's why: They begin with relatively simple questions, gradually increasing the difficulty as the stakes rise. This structure mimics real-world decision-making, where tasks can start easy but become more complex under pressure. Game shows let us evaluate how LLMs handle a broad spectrum of questions, from straightforward to highly complex, all in real-time [1][2].
It was a challenging process to convert game show data into a useful benchmark for LLMs. We had to sift through a vast amount of historical game show data, sourced from dedicated fanbase websites, to understand how contestant performance changed as questions became tougher. By analyzing where contestants tended to drop off, we were able to weigh each question by its difficulty. Harder questions, which knocked out many contestants, were assigned more points. This system allows us to evaluate LLMs in a way that mirrors the increasing complexity and stakes of real-world challenges.
What’s truly interesting about our approach is that it doesn’t just measure an LLM’s ability to answer fact-based questions—it tests how well models can navigate increasing pressure and complexity, just like contestants on a game show.
It’s the blend of increasing difficulty and real-time pressure that makes game shows an ideal framework for testing LLMs—just one of the many feathers in PARROT’s cap.
PARROT’S Features and Feathers
PARROT captures a broad range of question types and difficulty levels, offering a well-rounded assessment of an LLM's abilities [3][4]. Let’s look at its features, a.k.a feathers, which enable PARROT as a benchmark:
The Millionaire Metric: Real-World Decision-Making Under Pressure
The PARROT-Millionaire Metric assigns weights to questions based on their difficulty. Harder questions carry more weight, simulating the impact of high-pressure decision-making [5].
Using past game show data, we identified how many contestants were eliminated at each question level—the more difficult the question, the higher its weight and contribution to the overall score.
This approach ensures that LLMs are rewarded for solving tougher challenges, similar to real-life scenarios where success in difficult tasks holds more value. The weights were standardized across all seasons, ensuring fairness. By applying these weights to a model's accuracy, we generate a score that prioritizes solving complex problems, not just easy ones.
The Jeopardy Metric: Navigating Ambiguity and Complexity
The PARROT-Jeopardy Metric assesses how well models handle ambiguity and complexity, assigning higher weights to clues with higher dollar values.
For example, a $1,000 clue is harder than a $200 clue. We incorporated this concept into the PARROT-Jeopardy Metric by weighting clues according to their value. This ensures that models that solve harder questions receive a proportionally higher score. This metric captures a model's ability to handle complexity, going beyond basic, fact-based answers [5].
The PARROT Score as a metric combines the results from both metrics, offering a composite view of a model's ability to handle both straightforward and complex tasks:
Model |
PARROT |
MMLU |
IFeval |
BBH |
GPQA |
GPT-4o |
0.81 |
0.88 |
0.90 |
UNK |
0.53 |
Claude-3.5-Sonnet |
0.80 |
0.81 |
0.92 |
0.93 |
0.59 |
Gemini-flash-1.5 |
0.67 |
0.79 |
UNK |
0.88 |
0.59 |
Mistral (7.2B) |
0.65 |
0.71 |
0.54 |
0.25 |
0.28 |
Gemma-2 (9B) |
0.62 |
0.69 |
0.74 |
0.42 |
0.36 |
Llama-3.1 (8B) |
0.56 |
0.68 |
0.78 |
0.30 |
0.27 |
Llama-3 (8B) |
0.53 |
0.68 |
0.47 |
0.26 |
0.29 |
Qwen-2 (7.6B) |
0.53 |
0.69 |
0.56 |
0.37 |
0.31 |
Gemma (7B) |
0.45 |
0.64 |
0.38 |
0.12 |
0.29 |
Phi-3 (3.8B) |
0.33 |
0.69 |
0.54 |
0.37 |
0.34 |
Phi-3.5-Mini (3.8B) |
0.19 |
0.69 |
0.57 |
0.36 |
0.33 |
Table: PARROT Benchmark Scores
Did you notice something? Even if you are not an AI wizard, it is evident that these results show us that models still have a long way to go when it comes to reasoning under pressure, especially in scenarios where answers have consequences. We realized that accuracy alone isn’t enough; models need to demonstrate adaptability and complex reasoning too, and our PARROT addresses this gap creatively.
What’s Next for PARROT?
We’re releasing PARROT to the community and believe it’s an important step forward in LLM evaluation. We’d love to see how other models perform and where you as part of the community can take this idea. We’re also curious to hear your thoughts on how we can continue improving ways to evaluate an LLM.
Bibliography
- Jeopardy! In Encyclopaedia Britannica. Retrieved August 13, 2024, from https://www.britannica.com/topic/Jeopardy-American-television-game-show.
- "Who Wants to Be a Millionaire (American game show)." In Wikipedia: The Free Encyclopedia. Retrieved August 13, 2024, from https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show).
- Redblock AI Team. "How good are LLMs at Trivia Game Shows?" Redblock AI, 2024. https://www.redblock.ai/blog/genai-on-gameshows
- Redblock AI Team. "Introducing PARROT–A New LLM Benchmark using Television Trivia Datasets" Redblock AI, 2024. https://www.redblock.ai/blog/parrot-benchmark.
- Redblock AI Team. "The PARROT Benchmark: How LLMs stack up" Redblock AI, 2024. https://www.redblock.ai/blog/parrot-benchmarking-evaluating-llms.