Hallucination Rates for Major AI Models

Another reason to hire a writer and fact-checker for your AI and avoid embarrassment like this, this, this, or this.

Vectara analyzed major language models, testing the accuracy on 1,000 texts and released the results. This evaluates how often an LLM introduces hallucinations when summarizing a document.

Updated 11/1/23

Model	Accuracy	Hallucination Rate	Answer Rate
GPT 4	97.0 %	3.0 %	100.0 %
GPT 4 Turbo	97.0 %	3.0 %	100.0 %
GPT 3.5 Turbo	96.5 %	3.5 %	99.6 %
Llama 2 70B	94.9 %	5.1 %	99.9 %
Llama 2 7B	94.4 %	5.6 %	99.6 %
Llama 2 13B	94.1 %	5.9 %	99.8 %
Cohere-Chat	92.5 %	7.5 %	98.0 %
Cohere	91.5 %	8.5 %	99.8 %
Anthropic Claude 2	91.5 %	8.5 %	99.3 %
Mistral 7B	90.6 %	9.4 %	98.7 %
Google Palm	87.9 %	12.1 %	92.4 %
Google Palm-Chat	72.8 %	27.2 %	88.8 %

H/T The Rundown AI

Related