Large Language Models: the talk of the town
Unless someone has been disconnected from the technology world, they are likely aware of the significance of Generative AI in today’s landscape. With the emergence of new opportunities and unfolding questions about the applications of Large Language Models (LLMs), AI remains a central topic of discussion. Large corporations are increasingly adopting this technological marvel of the past decade to enhance their business practices in customer service, data analysis, predictive modeling, and decision-making processes.
With that in mind, how do these LLMs work? Are they difficult to incorporate? And do they genuinely make a significant impact in their applications? The answer is both yes and no. The effectiveness of using an LLM depends on leveraging its strengths, as not every released one excels in all possible domains. Their capabilities and limitations are directly tied to the data on which they have been primarily trained on.
One might wonder, wouldn’t it be easier if we could test or supervise the model while it is learning? Ideally, that’s exactly what we would want. However, it’s impractical to manually fact-check the vast amount of information being scanned, parsed, and processed by an LLM when billions of documents are involved. These models are self-learning agents that do not require human supervision to verify the quality information they’re picking on; instead, they recognize patterns and learn from them each time they encounter new information.
Several other ways exist to assess the quality of an LLM's responses. In this context, we will explore its performance in Question and Answering, one of this technology's primary applications.
Before we dive into the details of standards practiced, it is important to get familiar with these definitions:
- Metric: A metric is a standard way to measure something. For example, height can be measured in inches or centimeters, depending on your location.
- Benchmarking: Benchmarking involves comparing an LLM as a candidate against a set of other LLMs using a standard metric on possibly a well-defined task to assess its performance. For example, we can compare a list of LLMs and their performance towards text translation using a well-established machine-translation metric such as BLEU [1].
How are the LLM outputs judged?
The concept is straightforward: to assess the correctness of a response, we use pairs of questions labeled with corresponding answers (often referred to as 'gold answer') within a dataset that contains specific domain knowledge. The LLM is tested by asking questions from this set, generating a set of candidate answers, which are then validated.
The newly generated response for each question is compared to the labeled correct answer using a metric to assess its correctness. While it would be ideal for experts with domain-specific knowledge to evaluate the output from an LLM, the cost and time required for human judgment make this approach impractical for widespread use. Fortunately, thanks to the efforts of various academic and AI institutions, less precise but more cost-effective evaluation methods have been developed, which we will explore in later sections of this blog.
Evaluation can sometimes be subjective. While this may seem vague, at Redblock we believe evaluation can be influenced by who performs it. Consider this: what if we used an AI to judge or rate the quality of its outputs? This approach, known as 'LLM as the Judge' evaluation, often leads to the well-known problem of hallucination. In AI, hallucination refers to the phenomenon where the model confidently provides an incorrect answer. Even when challenged, the AI might still consider its false answer valid.
As discussed earlier, human evaluation is the preferred method for assessing an LLM's performance. In this approach, the LLM generates a set of possible outputs for a question within a specific domain. A human judge, typically an expert in that domain, selects the answer that appears closest to the correct one and sounds the most natural. The LLM then adjusts its weights (the parameters used to generate its output) to produce similar answers, thereby improving the quality of its text generation.
More importantly, companies that develop large-scale LLMs often prioritize metrics in which their models outperform direct competitors in the market. This focus has led to the creation of sophisticated datasets which allow us to evaluate LLMs across a broad range of topics, extending beyond the standard metrics mentioned earlier. The goal is to ensure that the evaluation process remains fair and challenging, as increasing the difficulty of questions is one of the most effective ways to identify an LLM's weaknesses.
On the other hand, evaluation also depends on the specific objectives of the tasks that an LLM is designed to address. LLMs have various capabilities; they can translate and mask text, generate summaries, tell stories, and answer questions based on the knowledge they have curated during training. These abilities can be measured differently, as shown in the table. However, the overarching purpose of evaluation remains consistent: to assess the model’s strengths and weaknesses, particularly in terms of its factual reasoning across domains, to identify any biases it may have developed during training, and to compare its performance to that of other LLMs to determine which is more accurate and human-like in reasoning, without the errors a human might make. Let’s take a look at these metrics:
Category |
Metric |
Description |
Text Translation |
BLEU |
Compares how similar the translated text is to reference translations by checking for common phrases. |
METEOR |
Checks translation accuracy by matching words, considering synonyms and word forms. |
TER/Minimum Edit Distance |
Counts the number of changes needed to make the translation match the reference. |
Text Masking |
Accuracy |
Measures how often the model correctly fills in missing words. |
Perplexity |
Shows how confidently the model predicts missing words; lower values are better. |
Text Classification |
Accuracy |
Determines the percentage of correct predictions overall. |
Precision |
Determines the percentage of correct positive predictions out of all positive predictions made. |
Recall |
Determines the percentage of correct positive predictions out of all actual positives. |
F1 Score |
Balance/ratio between precision and recall. |
AUC-ROC |
Measures how well the model separates different classes. |
Text Generation |
Perplexity |
Indicates how well the model predicts the next word; lower values are better. |
ROUGE |
Compare the generated text to reference text by looking for similar phrases. |
Human Evaluation |
Involves experts rating the quality of generated text by an LLM. |
Question & Answering |
Exact Match (EM) |
Percentage of answers that are exactly the same as the correct answer. |
F1 Score |
Measures overlap between predicted and correct answers, considering both precision and recall. |
Mean Reciprocal Rank (MRR) |
Averages the rank of the first correct answer in a list of predictions. |
BLEU |
Compares generated answers to reference answers using phrase similarity. |
Table: Standard Metrics for LLM Evaluation
These are a few metrics that can be used to evaluate the responses of an LLM within various types of applications.
What are some of the existing methods to evaluate LLMs?
Question and Answering is a crucial and widely applied task for LLMs. It involves decision-making, logical reasoning for problem-solving, information retrieval from documents or larger text corpora, and fact-checking. To evaluate these essential attributes, we use various datasets that allow us to benchmark the performance of LLMs.
- Popular datasets used for LLM evaluation:
The following datasets are curated from various sources and contain various questions. Here are the popularly used datasets for benchmarking [2]:
Dataset |
Description |
Purpose |
Relevance |
Massive Multitask Language Understanding (MMLU) |
Measures general knowledge across 57 subjects, ranging from STEM to social sciences. |
To assess the LLM's understanding and reasoning in various subject areas. |
Ideal for multifaceted AI systems that require extensive world knowledge and problem-solving ability. |
AI2 Reasoning Challenge (ARC) |
Tests LLMs on grade-school science questions that require deep general knowledge and reasoning abilities. |
To evaluate the ability to answer complex science questions that require logical reasoning. |
Useful for educational AI applications, automated tutoring systems, and general knowledge assessments. |
General Language Understanding Evaluation (GLUE) |
A collection of various language tasks from multiple datasets designed to measure overall language understanding. |
To provide a comprehensive assessment of language understanding abilities in different contexts. |
Crucial for applications requiring advanced language processing, such as chatbots and content analysis. |
HellaSwag |
Tests natural language inference by requiring LLMs to complete passages in a way that requires understanding intricate details. |
To evaluate the model's ability to generate contextually appropriate text continuations. |
Useful in content creation, dialogue systems, and applications requiring advanced text generation capabilities. |
TriviaQA |
A reading comprehension test with questions from sources like Wikipedia that demands contextual analysis. |
To assess the ability to sift through context and find accurate answers in complex texts. |
Suitable for AI systems in knowledge extraction, research, and detailed content analysis. |
GSM8K |
A set of 8.5K grade-school math problems that require basic to intermediate math operations. |
To test LLMs’ ability to work through multi-step math problems. |
Useful for assessing AI’s capability in solving fundamental mathematical problems valuable in educational contexts. |
Big-Bench Hard (BBH) |
A subset of BIG-Bench focuses on the most challenging tasks requiring multi-step reasoning. |
To challenge LLMs with complex tasks demanding advanced reasoning skills. |
Important for evaluating the upper limits of AI capabilities in complex reasoning and problem-solving. |
Table: Relevant LLM Benchmarking Datasets
These are just a few of the many datasets used for LLM benchmarking that have become industry standards for AI leaders developing new models. Each of these datasets is specifically designed to evaluate LLMs on targeted tasks within the broader field of Natural Language Processing.
- Benchmarking state-of-the-art LLMs:
Now that we’ve reviewed the available datasets and benchmarks, let us examine where the current LLMs stand against some of the benchmarks mentioned above:
Benchmark |
GPT-4o |
Claude-3.5 |
Gemini-1.5-Pro |
MMLU |
88.7% |
88.7% |
85.9% |
Code/HumanEval |
87.8% |
92.0% |
82.6% |
Math |
52.9% |
71.1% |
67.7% |
Reasoning/GPQA |
53.6% |
59.4% |
46.2% |
Big-Bench Hard |
86.8% |
93.1% |
89.2% |
Table: Benchmarks for State-of-the-Art LLMs
Each of the benchmarks from the table above is associated with the performance of LLMs toward a specific task.
Reasoning Tasks:
- MMLU (The benchmark used for testing reasoning ability) contains questions covering a wide range of subjects like humanities, STEM, and social sciences, designed to test a model's reasoning ability.
- GPT-4o outperforms Gemini-Pro and Claude Sonnet on the MMLU benchmark, as shown in the table above, with a score of 88.7%. Claude Sonnet matches this score, indicating it is a close competitor, while Gemini-Pro struggles with its performance of 85.9%. While Gemini has made strides with its newer model, it still does not stand up against GPT-4o and Claude Sonnet.
Math and Coding Proficiency:
- HumanEval: Evaluates coding ability by asking models to generate Python code. It Includes coding tasks where models are required to generate Python code that passes specific unit tests.
- Math (Benchmark Dataset): Assesses problem-solving in math with questions of varying difficulty. It consists of mathematical problems with corresponding gold answers.
- On the HumanEval benchmark, which assesses coding performance, GPT-4o scores 87.8%, closely competing with Claude Sonnet at 92.0%. However, it significantly outperforms Gemini-Pro with 82.6%. Once again, the performance is a close match between GPT-4o and Claude Sonnet.
- In the Math benchmark, which assesses the LLM’s ability to solve mathematical problems, GPT-4o scores 52.9%, lower than Claude Sonnet at 71.1%. Gemini-Pro still trails behind GPT-4o and Claude in mathematical reasoning. Claude clearly excels in this area without much competition from GPT-4o
Reasoning/GPQA:
- Measures how well models handle complex reasoning and problem-solving tasks. Contains questions usually of Graduate level understanding.
- GPT-4o scores 53.6% on the GPQA benchmark, which measures reasoning abilities. Claude Sonnet surpasses GPT-4o with a score of 59.4%, and Gemini-Pro lags further behind this time with a score of 46.2%. This indicates that while GPT-4o performs decently, Claude has a slightly better edge in reasoning tasks, and Gemini continues to show room for improvement.
Big-Bench Hard:
- Challenges models with difficult questions that test advanced reasoning skills. It contains a variety of texts, including comprehension questions.
- On the Big-Bench Hard benchmark, GPT-4o performs well with a score of 86.8%, although Claude Sonnet outperforms it for the second time with 93.1%. Gemini-Pro outperforms GPT-4o for the second time. This benchmark, which involves complex reasoning tasks, proves that Claude consistently delivers superior performance, while GPT-4o remains competitive but struggles to keep up around the board against Claude and Gemini.
Conclusion
Based on the evaluations and benchmarks discussed, it is evident that the LLMs tested show distinct performance differences across various tasks. The benchmarking results indicate that while some models excel in specific areas like reasoning or coding, others fall short in critical aspects.
These benchmarks highlight the importance of selecting the right model based on the task requirements. At Redblock AI, we utilize benchmarks to ensure that our analysis is both precise and actionable to choose the right LLM. The methodologies employed, including metric-based assessments and benchmarking against industry standards, provided clear insights into the strengths and limitations of each model tested.
Bibliography
- Papers with Code, "BLUE Dataset." Retrieved August 14, 2024, from https://paperswithcode.com/dataset/blue ↩︎
- Beeson, L., "LLM Benchmarks." GitHub, 2023. https://github.com/leobeeson/llm_benchmarks ↩︎
- OpenAI, "GPT-4o Mini: Advancing Cost-Efficient Intelligence." OpenAI, 2024. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ ↩︎
- Hugging Face, "Chapter 7: Transformers and transfer learning." Hugging Face, n.d. https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt ↩︎
- Clefourrier, L., "LLM Evaluation." Hugging Face, 2024. https://huggingface.co/blog/clefourrier/llm-evaluation ↩︎
- DeepMind, "Gemini: Flash." DeepMind, 2024. https://deepmind.google/technologies/gemini/flash/ ↩︎
- AllenAI, "AI2 ARC Dataset." Hugging Face, n.d. https://huggingface.co/datasets/allenai/ai2_arc ↩︎
- NYU MLL, "GLUE Dataset." Hugging Face, n.d. https://huggingface.co/datasets/nyu-mll/glue ↩︎
- Zellers, R., "HellaSwag Data." GitHub, n.d. https://github.com/rowanz/hellaswag/tree/master/data ↩︎
- Joshi, M., "TriviaQA Dataset." Hugging Face, n.d. https://huggingface.co/datasets/mandarjoshi/trivia_qa ↩︎
- OpenAI, "GSM8K Dataset." Hugging Face, n.d. https://huggingface.co/datasets/openai/gsm8k ↩︎
- Lukaemon, "BBH Dataset." Hugging Face, n.d. https://huggingface.co/datasets/lukaemon/bbh ↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.