Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.
Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
Factored Verification, an automated method, detects hallucinations in abstractive summaries, showing high accuracy and indicating that ChatGPT (16k), GPT-4, and Claude 2 hallucinate at different rates, which can be reduced with self-correction.
- Year
- 2023
- Venue
- arXiv 2023
- Authors
- 2
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2310.10627ARXIV-DEFAULT
- TL;DR
- Semantic Scholar