Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Today’s paper introduces Phi-3-mini, a highly capable 3.8B parameter language model that can run locally on a phone, yet rivals the performance of much larger models like GPT-3.5 and Mixtral 8x7B on academic benchmarks. The key for this performance lies is in the training dataset, which is a scaled-up version of the one used for the previous Phi-2 model, consisting of heavily filtered web data and synthetic data. Additionally, Phi-3-small (7B) and Phi-3-medium (14B) are introduced for even better performance while still being relatively small in terms of number of parameters. Moreover, Phi-3-mini is small enough to be able to run it locally on phones.
Method Overview
Phi-3 uses a transformer decoder architecture. It achieves its impressive capabilities by leveraging a carefully curated training dataset, rather than relying solely on massive scale. The training data consists of two main components:
1. Heavily filtered web data: The authors filtered web data from various open internet sources to include only high-quality content at the appropriate "educational level". This ensures the model learns general knowledge and language understanding.
2. Synthetic data: In addition to web data, the model is trained on synthetic data generated by larger language models. This synthetic data is designed to teach the model logical reasoning skills and various niche capabilities.
The training is done in two sequential phases. Phase 1 focuses on the filtered web data to build the model's general knowledge. Phase 2 then incorporates an even more selective subset of the web data combined with the synthetic data to improve the model's reasoning abilities.
By carefully calibrating the training data mixture to be in the "data optimal" regime for the model's size, good performance is obtained from a relatively small number of parameters.
Results
The Phi-3 family of models achieves impressive results on a wide range of academic benchmarks measuring reasoning ability, Moreover, Phi-3-mini has a performance close to models like GPT-3.5 despite being over 10x smaller. For example:
- 69% accuracy on the MMLU benchmark (vs 71% for GPT-3.5)
- 8.38 score on the challenging MT-bench (vs 8.35 for GPT-3.5)
The model also undergoes post-training with instruction finetuning and preference modeling to improve its chat capabilities, robustness, and safety. Evaluations show it has significantly lower rates of harmful or inappropriate responses.
Conclusion
Phi-3, especially Phi-3-mini demonstrates that through careful data curation and optimization, it is possible to create highly capable language models that can run on a phone, without sacrificing performance. This opens up exciting possibilities for more efficient and accessible AI assistants. For more information please consult the full paper.
Congrats to the authors for their work!
Microsoft. "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." arXiv preprint arXiv:2404.14219 (2024).