Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper

The article discusses a comparison between open-source Language Learning Models (LLMs) like Llama 2 and closed LLMs like OpenAI's gpt-3.5-turbo and gpt-4, focusing on their factuality in summarization tasks. The experiment involved using a set of 373 news report statements and having each LLM decide which statement was the factually correct summary. The results showed that Llama-2-70b and gpt-4 were almost on par in terms of factuality, both nearing human performance levels. However, Llama-2-70b was found to be 30 times cheaper than gpt-4 for equivalent levels of factuality in summarization. The study also revealed issues with smaller models and gpt-3.5-turbo, including problems with following instructions and ordering bias.

Key takeaways

The study found that Llama-2-70b, an open-source language model, is almost as accurate as GPT-4 in terms of factuality, and significantly better than GPT-3.5-turbo.
Two practical issues were encountered during the experiment: not following instructions and ordering bias. Larger models were better at following instructions, and ordering bias was tested by swapping the order of options.
Despite Llama 2's tokenization being longer than ChatGPT's by 19%, it was found to be 30 times cheaper than GPT-4 for equivalent levels of factuality in summarization.
The study suggests using Llama-2-70b or GPT-4 to increase the chances of a factual summarization, and advises against using smaller Llamas or GPT-3.5-turbo.

Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper | Anyscale

Key takeaways

Discussion (0)