Why a new LLM Benchmark?
Evaluating the performance of a Large Language Model (LLM) is crucial for understanding its ability to answer questions accurately and effectively. Ideally, this evaluation would involve domain-specific human experts manually judging the generated outputs. However, the cost of hiring such experts is often prohibitive, especially for small businesses that need to benchmark LLMs for their specific use cases. In our previous blog posts, we discussed various aspects of LLM evaluation, including methods to quantify performance, available datasets for benchmarking, and the potential of using game show data to test an LLM's knowledge. To address these challenges, we developed PARROT using data from popular game shows.
Introducing PARROT: Performance Assessment of Reasoning and Responses on Trivia
PARROT is a novel dataset designed to evaluate the question-answering capabilities of LLMs in a more challenging and human-centric context. Unlike traditional datasets that focus solely on accuracy, PARROT leverages the dynamic and interactive nature of trivia game shows to test LLMs' abilities to:
- Understand and respond to complex questions.
- Categorize knowledge and synthesize coherent answers.
- Exhibit common-sense reasoning and real-world knowledge.
- Handle uncertainty and ambiguity in a way that mimics human behavior during trivia.
Purpose of PARROT
PARROT tests LLMs' ability to mimic human behavior in several key areas:
- Conversational fluency: Tests their ability to understand and respond to context-dependent questions, a skill essential for human-like communication.
- Common sense reasoning: Many questions require LLMs to apply common-sense knowledge and reasoning to arrive at a correct answer. This tests their ability to understand and interpret information in a way that is consistent with human intuition.
- Real-world knowledge: PARROT's questions cover a wide range of topics, from historical events to popular culture. This tests LLMs' ability to access and apply relevant real-world knowledge, a crucial skill for effective communication.
- Handling uncertainty: Some questions are designed to be ambiguous or have multiple correct interpretations. This tests LLMs' ability to handle uncertainty and provide informative responses even when the information is incomplete or contradictory.
PARROT: A Dual-Dataset Approach
PARROT is comprised of two distinct datasets
- PARROT-Jeopardy: A dataset comprising questions from the show Jeopardy, with short, concise questions for topic breadth, reasoning, and ambiguity handling.
- PARROT-Millionaire: A dataset comprising questions from the show Who Wants to Be a Millionaire, known for its straightforward nature and broad range of topics, can be a valuable dataset for evaluating an LLM's knowledge.
By combining these two datasets, PARROT offers a more comprehensive evaluation of LLMs’ question-answering capabilities, testing their ability to handle different question styles and sentence structures and providing insights into their strengths and weaknesses across various domains.
Why These Two Game shows for PARROT?
I spoke to Indus, Founder and CEO of Redblock, about this choice of game shows for our dataset:
“I have been a quizzing nerd since my middle school days. When the Millionaire show launched in India, I qualified to be a contestant in the show's first season hosted by the legendary actor Amitabh Bachchan.
And I have never missed a Jeopardy episode for the last many years!
When considering this problem of benchmarking the LLMs, I thought, why not use the game show data? They span time, the clues are cryptic, and have questions around multiple categories–and the best part is that the answers are already available.”
Curating the Millionaire Dataset
Initially, we began searching for an existing Millionaire game show dataset to work on this idea. Despite its global reach and decades-long run [1], one would assume there would be a wealth of data out there, but that wasn’t the case.
So, we took matters into our own hands at Redblock and curated the PARROT-Millionaire dataset by scraping and organizing data from the Millionaire Fandom site [2]. We will write a separate blog on our steps to scrape and curate this dataset.
The PARROT-Millionaire dataset has 22,698 samples (questions and answers). Each sample was scraped from the fandom page and includes features such as question type, difficulty level, and answer choices.
Field Name |
Description |
question_info |
The price value and the current question number. |
question |
Question in the text. |
options |
Four options to the question. |
correct_answer |
The correct answer to the question. |
price |
Engineered feature from question_info, which gives the dollar value of the question price. |
normalized_options |
An engineered feature that provides options for text normalization. |
normalized_correct_opt |
An engineered feature that gives the normalized text of the correct_answer. |
Table: PARROT-Millionaire features explained.
Curating Jeopardy
Jeopardy has been a staple of American television for nearly 50 years, first created by Merv Griffin in 1964. With over 8,000 episodes across 40 seasons, curating a dataset from all seasons would be impractical due to the sheer volume of data and the complexity that follows in processing such a large set [3]. Although Jeopardy datasets are available on open-source platforms like Kaggle, GitHub, and Hugging Face, at Redblock, we've curated a version tailored to meet our specific requirements.
We selected seven key seasons—1, 2, 12, 13, 19, 33, and 34—to ensure a representative sample across the show's timeline. We scraped data from the J! Archive, a fan-created archive containing over 500,000 clues [4], to create the PARROT-Jeopardy dataset.
The PARROT-Jeopardy dataset features clues as questions, allowing us to effectively gauge the reasoning and ambiguity-handling ability of an LLM. In contrast, PARROT-Millionaire focuses on straightforward questions, providing a way to assess the LLM’s ability to answer a question by structuring it in more than one way. PARROT-Jeopardy comprises a total of 61,462 samples and includes features such as category, clue format, and difficulty level [5].
Field Name |
Description |
ep_num |
Episode number from the season. |
air_date |
The date on which the episode was aired. |
extra_info |
Additional information related to the episode, including the host's name. |
round_name |
The round being played (e.g., Jeopardy, Double Jeopardy, Final Jeopardy). |
coord |
The coordinates of the clues on the game board. |
category |
The category to which the clue belongs. |
value |
The monetary value of the clue. |
daily_double |
A boolean variable indicating whether the clue is part of the Daily Double round. |
question |
The corresponding clue within the category. |
answer |
The labeled answer or guess. |
correct_attempts |
The count of contestants who answered correctly. |
wrong_attempts |
The count of contestants who answered incorrectly. |
Table: PARROT-Jeopardy features explained.
PARROT Benchmarking
Now that we’ve introduced the brand-new datasets, we have some interesting observations coming up! We scaled some of the best LLMs against these datasets with our unique metric. This metric is built into the benchmarking, making it more than just a dataset but a framework to gauge an LLM’s ability to answer questions accurately and effectively. Keep an eye out for our next blog, where we’ll share how these LLMs perform in handling trivia questions to win a prize.
Bibliography
- Wikipedia, "Who Wants to Be a Millionaire (American game show)." In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wi-ki/Who_Wants_to_Be_a_Millionaire_(American_game_show) ↩︎
- "Jackie Edmiston." Millionaire Wiki, Fandom, August 13, 2024. https://millionaire.fandom.com/wi-ki/Jackie_Edmiston ↩︎
- Encyclopaedia Britannica. Jeopardy! (American television game show). In Encyclopaedia Britannica. Retrieved August 13, 2024, from https://www.britannica.com/topic/Jeopardy-American-television-game-show↩︎
- J! Archive. Retrieved August 14, 2024, from https://j-archive.com/ ↩︎
- Redblock AI Team. (2024). PARROT: Performance Assessment of Reasoning and Responses on Trivia. Redblock. https://huggingface.co/data-sets/redblock/parrot↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.