Another reason to hire a writer and fact-checker for your AI and avoid embarrassment like this, this, this, or this.
Vectara analyzed major language models, testing the accuracy on 1,000 texts and released the results. This evaluates how often an LLM introduces hallucinations when summarizing a document.
Updated 11/1/23
Model | Accuracy | Hallucination Rate | Answer Rate |
GPT 4 | 97.0 % | 3.0 % | 100.0 % |
GPT 4 Turbo | 97.0 % | 3.0 % | 100.0 % |
GPT 3.5 Turbo | 96.5 % | 3.5 % | 99.6 % |
Llama 2 70B | 94.9 % | 5.1 % | 99.9 % |
Llama 2 7B | 94.4 % | 5.6 % | 99.6 % |
Llama 2 13B | 94.1 % | 5.9 % | 99.8 % |
Cohere-Chat | 92.5 % | 7.5 % | 98.0 % |
Cohere | 91.5 % | 8.5 % | 99.8 % |
Anthropic Claude 2 | 91.5 % | 8.5 % | 99.3 % |
Mistral 7B | 90.6 % | 9.4 % | 98.7 % |
Google Palm | 87.9 % | 12.1 % | 92.4 % |
Google Palm-Chat | 72.8 % | 27.2 % | 88.8 % |
H/T The Rundown AI