Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Sage Lazzaro

Multimodal AI puts on quite a show, but it’s still in its infancy

(Credit: Duy Ho for Fortune)

Hello and welcome to Eye on AI.

Google this past week made clear it’s not going to let 2023 end without marking its own leap in AI. The tech giant, which has fallen behind OpenAI despite making the crucial research breakthrough that made ChatGPT possible in the first place, finally unveiled Gemini, its long-rumored “largest and most capable” AI model yet. 

The announcement offers a lot to unpack. Gemini—which comes in three increasingly powerful Nano, Pro, and Ultra tiers—is already powering Bard and a few features on Pixel 8 Pro smartphones. Tomorrow, Gemini will be made available to Google Cloud customers via its VertexAI platform, and Google also plans to integrate Gemini into other products across its services such as Search, Chrome, and Ads. Google touted numerous benchmark wins against OpenAI, but because Gemini Ultra, the most powerful tier positioned to compete with GPT-4, won’t actually be available until next year, it’s too early to fully draw any conclusions.  

One thing that’s clear no matter how Gemini stacks up to OpenAI’s models, however, is that it’s providing a window into the next era of LLMs where multimodality will be the norm. Google created Gemini to be multimodal from launch, meaning it was trained on and can handle combinations of text, image, video, and code prompts, opening up tons of new use cases and user experiences. Google VP Sissie Hsiao called the multimodal capabilities of Gemini the “most visually stunning” of the model’s advancement while onstage at Fortune’s Brainstorm AI event yesterday (more on that later), and leaders across the industry are pointing to multimodal as the obvious next step in the technology.

“I’m not sure people realize how much multimodal AI will become the default, even for regular chatbot applications,” Robert Nishihara, CEO of Anyscale, the company behind the Ray developer framework that’s powered much of the GenAI boom, told Eye on AI. He added that multimodality is going to become “fundamental to the way we interface with these models.” 

If you’re chatting with your insurance company via an AI chatbot, for example, multimodality would make it possible to incorporate photos and videos of the damage into the conversation. It could also help developers by enabling coding co-pilots to preemptively spot issues in code as they write it. During her interview, Hsiao gave the example of how she recently input photos of a restaurant menu and wine menu into Bard and asked it for help creating the ideal pairing. 

While some multi-modal models already exist, these capabilities have typically been stitched together on top of text-based LLMs. Language models only became viable in the last year or so, and multimodal models are even harder from a technical perspective. The act of combining all these different modalities into a single model from the get-go is far simpler than piecing it together, Nishihara said, but has required a fundamental shift at the architecture level. Whereas convolutional neural networks have long been used to process image and video data, Nishihara credits the recent shift to using transformers to process this data as well for kicking off the recent progress in multimodal.

Still, multimodal AI has several limitations and challenges. One of the most significant is the size of multimodal data, such as photo and video, which is orders of magnitude larger than text data. This makes building applications more data-intensive and introduces new infrastructure challenges. It also has massive impacts on cost, as running data-intensive workloads on GPUs can be extremely expensive. 

Solutions to these issues will come from the hardware space, according to Nishihara. Pointing to how Cloud Tensor Processing Units (TPUs) perform quite well at processing image data, he said we’re going to start to see more interest in a variety of hardware accelerators.

“As we work and experiment with more modalities of data, we’re going to see the hardware ecosystem flourish and alleviate some of the resource challenges the industry is experiencing right now,” he said. “That said, we’re still in the early phases and going through growing pains, so I wouldn’t expect that to be visible in the next six months.”

And with that, here’s the rest of this week’s AI news.

Sage Lazzaro
sage.lazzaro@consultant.fortune.com
sagelazzaro.com

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.