Better Siri is coming: what Apple’s research says about its AI plans

Image: Cath Virginia / The VergeApple hasn’t talked too much about AI so far — but it’s been working on stuff. A lot of stuff. Continue reading…

May 5, 2024 - 21:02

0 60

Better Siri is coming: what Apple’s research says about its AI plans

It would be easy to think that Apple is late to the game on AI. Since late 2022, when ChatGPT took the world by storm, most of Apple’s competitors have fallen over themselves to catch up. While Apple has certainly talked about AI and even released some products with AI in mind, it seemed to be dipping a toe in rather than diving in headfirst.

But over the last few months, rumors and reports have suggested that Apple has, in fact, just been biding its time, waiting to make its move. There have been reports in recent weeks that Apple is talking to both OpenAI and Google about powering some of its AI features, and the company has also been working on its own model, called Ajax.

If you look through Apple’s published AI research, a picture starts to develop of how Apple’s approach to AI might come to life. Now, obviously, making product assumptions based on research papers is a deeply inexact science — the line from research to store shelves is windy and full of potholes. But you can at least get a sense of what the company is thinking about — and how its AI features might work when Apple starts to talk about them at its annual developer conference, WWDC, in June.

Smaller, more efficient models

I suspect you and I are hoping for the same thing here: Better Siri. And it looks very much like Better Siri is coming! There’s an assumption in a lot of Apple’s research (and in a lot of the tech industry, the world, and everywhere) that large language models will immediately make virtual assistants better and smarter. For Apple, getting to Better Siri means making those models as fast as possible — and making sure they’re everywhere.

In iOS 18, Apple plans to have all its AI features running on an on-device, fully offline model, Bloomberg recently reported. It’s tough to build a good multipurpose model even when you have a network of data centers and thousands of state-of-the-art GPUs — it’s drastically harder to do it with only the guts inside your smartphone. So Apple’s having to get creative.

In a paper called “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (all these papers have really boring titles but are really interesting, I promise!), researchers devised a system for storing a model’s data, which is usually stored on your device’s RAM, on the SSD instead. “We have demonstrated the ability to run LLMs up to twice the size of available DRAM [on the SSD],” the researchers wrote, “achieving an acceleration in inference speed by 4-5x compared to traditional loading methods in CPU, and 20-25x in GPU.” By taking advantage of the most inexpensive and available storage on your device, they found, the models can run faster and more efficiently.

Apple’s researchers also created a system called EELBERT that can essentially compress an LLM into a much smaller size without making it meaningfully worse. Their compressed take on Google’s Bert model was 15 times smaller — only 1.2 megabytes — and saw only a 4 percent reduction in quality. It did come with some latency tradeoffs, though.

In general, Apple is pushing to solve a core tension in the model world: the bigger a model gets, the better and more useful it can be, but also the more unwieldy, power-hungry, and slow it can become. Like so many others, the company is trying to find the right balance between all those things while also looking for a way to have it all.

Siri, but good

A lot of what we talk about when we talk about AI products is virtual assistants — assistants that know things, that can remind us of things, that can answer questions, and get stuff done on our behalf. So it’s not exactly shocking that a lot of Apple’s AI research boils down to a single question: what if Siri was really, really, really good?

A group of Apple researchers has been working on a way to use Siri without needing to use a wake word at all; instead of listening for “Hey Siri” or “Siri,” the device might be able to simply intuit whether you’re talking to it. “This problem is significantly more challenging than voice trigger detection,” the researchers did acknowledge, “since there might not be a leading trigger phrase that marks the beginning of a voice command.” That might be why another group of researchers developed a system to more accurately detect wake words. Another paper trained a model to better understand rare words, which are often not well understood by assistants.

In both cases, the appeal of an LLM is that it can, in theory, process much more information much more quickly. In the wake-word paper, for instance, the researchers found that by not trying to discard all unnecessary sound but, instead, feeding it all to the model and letting it process what does and doesn’t matter, the wake word worked far more reliably.

Once Siri hears you, Apple’s doing a bunch of work to make sure it understands and communicates better. In one paper, it developed a system called STEER (which stands for Semantic Turn Extension-Expansion Recognition, so we’ll go with STEER) that aims to improve your back-and-forth communication with an assistant by trying to figure out when you’re asking a follow-up question and when you’re asking a new one. In another, it uses LLMs to better understand “ambiguous queries” to figure out what you mean no matter how you say it. “In uncertain circumstances,” they wrote, “intelligent conversational agents may need to take the initiative to reduce their uncertainty by asking good questions proactively, thereby solving problems more effectively.” Another paper aims to help with that, too: researchers used LLMs to make assistants less verbose and more understandable when they’re generating answers.

A series of images depicting collaborative AI editing of a photo.

Pretty soon, you might be able to edit your pictures just by asking for the changes.

Image: Apple

AI in health, image editors, in your Memojis

Whenever Apple does talk publicly about AI, it tends to focus less on raw technological might and more on the day-to-day stuff AI can actually do for you. So, while there’s a lot of focus on Siri — especially as Apple looks to compete with devices like the Humane AI Pin, the Rabbit R1, and Google’s ongoing smashing of Gemini into all of Android — there are plenty of other ways Apple seems to see AI being useful.

One obvious place for Apple to focus is on health: LLMs could, in theory, help wade through the oceans of biometric data collected by your various devices and help you make sense of it all. So, Apple has been researching how to collect and collate all of your motion data, how to use gait recognition and your headphones to identify you, and how to track and understand your heart rate data. Apple also created and released “the largest multi-device multi-location sensor-based human activity dataset” available after collecting data from 50 participants with multiple on-body sensors.

Apple also seems to imagine AI as a creative tool. For one paper, researchers interviewed a bunch of animators, designers, and engineers and built a system called Keyframer that “enable[s] users to iteratively construct and refine generated designs.” Instead of typing in a prompt and getting an image, then typing another prompt to get another image, you start with a prompt but then get a toolkit to tweak and refine parts of the image to your liking. You could imagine this kind of back-and-forth artistic process showing up anywhere from the Memoji creator to some of Apple’s more professional artistic tools.

In another paper, Apple describes a tool called MGIE that lets you edit an image just by describing the edits you want to make. (“Make the sky more blue,” “make my face less weird,” “add some rocks,” that sort of thing.) “Instead of brief but ambiguous guidance, MGIE derives explicit visual-aware intention and leads to reasonable image editing,” the researchers wrote. Its initial experiments weren’t perfect, but they were impressive.

We might even get some AI in Apple Music: for a paper called “Resource-constrained Stereo Singing Voice Cancellation,” researchers explored ways to separate voices from instruments in songs — which could come in handy if Apple wants to give people tools to, say, remix songs the way you can on TikTok or Instagram.

An image showing the Ferret-UI AI system from Apple.

In the future, Siri might be able to understand and use your phone for you.

Image: Apple

Over time, I’d bet this is the kind of stuff you’ll see Apple lean into, especially on iOS. Some of it Apple will build into its own apps; some it will offer to third-party developers as APIs. (The recent Journaling Suggestions feature is probably a good guide to how that might work.) Apple has always trumpeted its hardware capabilities, particularly compared to your average Android device; pairing all that horsepower with on-device, privacy-focused AI could be a big differentiator.

But if you want to see the biggest, most ambitious AI thing going at Apple, you need to know about Ferret. Ferret is a multi-modal large language model that can take instructions, focus on something specific you’ve circled or otherwise selected, and understand the world around it. It’s designed for the now-normal AI use case of asking a device about the world around you, but it might also be able to understand what’s on your screen. In the Ferret paper, researchers show that it could help you navigate apps, answer questions about App Store ratings, describe what you’re looking at, and more. This has really exciting implications for accessibility but could also completely change the way you use your phone — and your Vision Pro and / or smart glasses someday.

We’re getting way ahead of ourselves here, but you can imagine how this would work with some of the other stuff Apple is working on. A Siri that can understand what you want, paired with a device that can see and understand everything that’s happening on your display, is a phone that can literally use itself. Apple wouldn’t need deep integrations with everything; it could simply run the apps and tap the right buttons automatically.

Again, all this is just research, and for all of it to work well starting this spring would be a legitimately unheard-of technical achievement. (I mean, you’ve tried chatbots — you know they’re not great.) But I’d bet you anything we’re going to get some big AI announcements at WWDC. Apple CEO Tim Cook even teased as much in February, and basically promised it on this week’s earnings call. And two things are very clear: Apple is very much in the AI race, and it might amount to a total overhaul of the iPhone. Heck, you might even start willingly using Siri! And that would be quite the accomplishment.