Introduction
At Redblock, we are exploring how large language models (LLMs) can handle trivia questions. To do this, we are focusing on evaluating and benchmarking large language models. If you're unfamiliar with these concepts or haven't read our previous article on various benchmarking techniques used to assess an LLM’s performance, you might find it helpful to start here. We're trying to find where LLMs might be weak at answering questions, particularly by using data from two well-known U.S. television game shows: Jeopardy and Who Wants to Be a Millionaire.
Using game show data for Question & Answering
Game shows have a wide range of fact-checked questions on a wide range of topics. This makes them a valuable source of human-labeled data, essential for effective benchmarking and ensuring that evaluations are grounded in real-world knowledge.
Let us examine two game shows with the potential to understand why they're ideal for testing AI's question-answering capabilities.
Jeopardy
Jeopardy is an American television game show created by Merv Griffin [1]. Unlike traditional quiz competitions, here, the typical question-and-answer format is reversed. Contestants are given general knowledge clues related to a specific category. They must identify the person, place, thing, or idea described by the clue, responding as a question.
More importantly, each clue is associated with a monetary value that a contestant wins if they guess the answer correctly.
Consider the following example:
Category: Regions and States
Clue: This state is known as the Sunshine State
Contestant’s Answer: What is Florida?
The game consists of three standard rounds (excluding the specials), namely:
- Jeopardy
- Double Jeopardy
- Final Jeopardy
The Jeopardy and Double Jeopardy rounds each feature game board comprising six categories, with five clues per category. The clues are valued by dollar amounts from lowest to highest, ostensibly increasing in difficulty. In the Final Jeopardy round, contestants are the category and the clue upfront. At this time, they write down their wager amount and their guess on a board. A correct guess adds the wagered amount to their score, while an incorrect guess subtracts the amount from their score.
The novelty of using Jeopardy for QA lies in its potential to evaluate the reasoning ability of an LLM. The monetary value of a clue often correlates with the complexity or ambiguity of the answer—the higher the value, the more challenging or misleading the clue might be. It will be fascinating to see how an LLM navigates these scenarios and what outputs it generates. Redblock is currently exploring this in a novel way.
Who Wants to be a Millionaire?
Who Wants to Be a Millionaire is a widely popular game show with various adaptations across the globe. The show first aired in Britain in 1998 and was later adapted in the United States and other countries [2]. For our research, we are evaluating the performance of LLMs using questions from the United States syndicated version of this game show.
If you’re unfamiliar with this show, here’s a quick and simple breakdown of the game structure. The game consists of 15 multiple-choice questions, starting with a $100 question and culminating in a $1,000,000 question for the 15th. A contestant is eliminated if they answer any question incorrectly during their session.
We use questions from Who Wants to Be a Millionaire to assess LLM performance with multiple-choice questions across various fields [3], where difficulty increases as the game progresses. This method contrasts with the Jeopardy format discussed earlier and offers a valuable comparison of how an LLM performs when it is given a vague clue versus a straightforward question.
Advantages of employing questions from these game shows
- Data Quality: A validated test set is essential for LLMs to ensure that their performance is accurately measured and compared to other models, preventing inflated benchmarks and promoting fair evaluation. Since these questions are sourced from a game show, they are generally reliable, allowing us to use them without concern about the authenticity of the question-answer pairs.
- Variety: These questions cover a wide range of topics, from music to science, including subjects like photosynthesis, history, and literature, demonstrating the importance of considering knowledge across various domains and not just math.
- Quantity: These shows have been on the air for several decades, amassing a wealth of knowledge that is ripe for use in LLM testing. The sheer volume of curated questions offers a rich resource for evaluating LLM performance.
Disadvantages of Using Game Show Questions for LLM QA Evaluation
- Dataset Curation: A significant challenge in using data from game shows is finding a reliable dataset. Even though one might expect to find plenty on platforms like Kaggle or Google Dataset Search, our initial search didn't turn up much. And even the few datasets we did find often lacked clear information about where they came from, making it hard to trust their accuracy.
- Domain Bias: While game shows cover a broad range of topics, they may still reflect a bias toward certain domains that are more popular or easily accessible information.
- Outdated Information: Questions can become outdated, especially in contexts like trivia games or quizzes where knowledge is time-sensitive. For instance, a question about the artist with the most awards in 2014 could have a different answer in 2024, reflecting a shift in the relevance of such data over time.
Redblock is actively working to address these gaps by curating valuable datasets and planning to release open-source versions to the community in the following weeks. We are committed to advancing LLM evaluation, providing meaningful insights, tackling the challenges of testing LLMs on their knowledge, and addressing the shortcomings in the current set of benchmarks.
Bibliography
- Wikipedia, "Jeopardy!" In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wiki/Jeopardy!#Gameplay ↩︎
- Wikipedia, "Who Wants to Be a Millionaire (American game show)." In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show) ↩︎
- "Jackie Edmiston." Millionaire Wiki, Fandom, August 13, 2024. https://millionaire.fandom.com/wiki/Jackie_Edmiston ↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.