Microsoft has published the latest version of its small language model Phi-3.5. This new version is a big upgrade on the previous generation, beating smaller models from leading players like Google, OpenAI, Mistral, and Meta on several important metrics.
Phi-3.5 comes in 3.8 billion, 4.15 billion, and 41.9 billion parameter versions. All three are available to download for free and can be run using a local tool like Ollama.
It performed particularly well at reasoning, only being beaten by GPT-4o-mini out of the leading small models. It also did well on math benchmarks, significantly passing Llama and Gemini.
Small language models like Phi-3.5 demonstrate efficiency improvements in AI and add credence to OpenAI CEO Sam Altman's goal of creating intelligence too cheap to meter.
What’s new in Phi-3.5
🔥 New Phi-3.5 models are now on the Open LLM Leaderboard!• Phi-3.5-MoE-instruct leads all Microsoft models with a 35.1 average score, ranking 1st in the 3B category and 10th among all chat models• Phi-3.5-mini-instruct scored 27.4 points, taking 3rd place in the 3B category… pic.twitter.com/yNcOR2bcxXAugust 22, 2024
Phi-3.5 comes in a vision model version that can understand images and not just text, as well as a mixture of expert models to split learning tasks across different sub-networks for more efficient processing.
The mixture of expert models beats Gemini Flash 1.5, which is the model used in the free version of the Gemini chatbot on multiple benchmarks and has a large 128k context window. While this is significantly smaller than Gemini itself, it is equal to ChatGPT and Claude.
The main benefit of a very small model like the one I installed is that it could be bundled with an application or even installed on an Internet of Things device such as a smart doorbell. This would allow for facial recognition without sending data to the cloud.
The smallest model was trained on 3.4 trillion tokens of data using 512 Nvidia H100 GPUs over 10 days. The mixture of expert models comprised 16 3.8b parameter models, used 4.9 trillion tokens and took 23 days to train.
How well does Phi-3.5 actually work?
I installed and ran the smaller 3.8 billion parameter version of Phi-3.5 on my laptop and found it less impressive than the benchmarks suggest. While it was verbose in its responses, often the phrasing left a lot to be desired, and it struggled with some simple tests.
I asked it a classic: “Write a short one-sentence story where the first letter of a word is the same as the last letter of the previous word.” Even after clarification, it failed spectacularly.
I haven’t tried the larger mixture of expert models. However, I’m told that judging by the benchmarks, it solves some of the issues with the version of the model I tried. The benchmarks suggest its output will be of similar quality to OpenAI’s GPT-4o-mini, the version that comes with the free version of ChatGPT.
One area that seems to outperform GPT-4o-mini above others is in STEM and social sciences areas. Its architecture allows it to maintain efficiency while managing complex AI tasks in different languages.