Hallucination Rates for Major AI Models

Another reason to hire a writer and fact-checker for your AI and avoid embarrassment like this, this, this, or this.

Vectara analyzed major language models, testing the accuracy on 1,000 texts and released the results. This evaluates how often an LLM introduces hallucinations when summarizing a document.

Updated 11/1/23

ModelAccuracyHallucination RateAnswer Rate
GPT 497.0 %3.0 %100.0 %
GPT 4 Turbo97.0 %3.0 %100.0 %
GPT 3.5 Turbo96.5 %3.5 %99.6 %
Llama 2 70B94.9 %5.1 %99.9 %
Llama 2 7B94.4 %5.6 %99.6 %
Llama 2 13B94.1 %5.9 %99.8 %
Cohere-Chat92.5 %7.5 %98.0 %
Cohere91.5 %8.5 %99.8 %
Anthropic Claude 291.5 %8.5 %99.3 %
Mistral 7B90.6 %9.4 %98.7 %
Google Palm87.9 %12.1 %92.4 %
Google Palm-Chat72.8 %27.2 %88.8 %

H/T The Rundown AI