During Build 2024, Microsoft announced a new version of the company’s small language AI model, Phi-3, which is capable of analyzing images and telling users what’s in them.
The new version, Phi-3-vision, is a multimodal model. For those unaware, especially with OpenAI’s GPT-4o and Google’s updates to Gemini, a multimodal model means that the AI tool can read text and images.
Phi-3-vision is meant for use on mobile devices as it features a 4.2 billion-parameter model. An AI model’s parameters are a shorthand for understanding how complex a model is and how much of the training it receives it understands. Microsoft has been iterating the Phi model on previous versions. So, Phi-2, for example, learned from Phi-1 and grew with new capabilities, and Phi-3 is similar to Phi-2, trained on Phi-2 and added capabilities.
Phi-3-vision can perform general visual reasoning tasks, such as analyzing charts and images. Unlike other more well-known models, like OpenAI’s DALL-E, Phi-3-vision can only “read” an image; it cannot generate images.
Microsoft has released several of these small AI models. They’re designed to run locally and on a wider range of devices than larger models like Google’s Gemini or even ChatGPT. No internet connection is required. They also reduce the computing power needed to run certain tasks, like solving math problems, as Microsoft’s small Orca-Math model does.
The first iteration of Phi-3 was announced in April when Microsoft released the tiny Phi-3-mini. In benchmark tests, it performed quite well against larger models like Meta’s Llama 2. The mini model has just 3.8 billion parameters. There are two other models, Phi-3-small and Phi-3-medium, which feature 7 billion parameters and 14 billion parameters, respectively.
Phi-3-vision is available in preview right now. The three other Phi-3 models, Phi-3-mini, Phi-3-small and Phi-3-medium, are accessible via the Azure Machine Learning model catalog and collections. To utilize them, you’ll need a paid Azure account and Azure AI Studio hub.