How to Identify AI-Generated Speech
You never really know who's talking these days.
Credit: Stacey Zhu
This post is part of Lifehacker’s “Exposing AI” series. We’re exploring six different types of AI-generated media, and highlighting the common quirks, byproducts, and hallmarks that help you tell the difference between artificial and human-created content.
In recent years, AI technologies have made it possible to clone someone else's voice and make that "person" say anything you want. You don't even need to be an expert to do it: A quick Google search, and you can have anyone from President Biden to SpongeBob speak your words. It's fascinating, hilarious, and terrifying.
AI voice technology can be used for good: Apple's Personal Voice feature, for example, lets you create a version of your own voice you can use for text-to-speech, designed for people who are losing the ability to speak for themselves. It's amazing that we have the ability to preserve people's voices, so rather than use a generic TTS voice, their words really sound like their own.
Of course, there's the other side of the coin: the potential for rampant misinformation. When the current tech makes it all too easy to make anyone say anything, how can you trust that what you're listening to online was actually said?
How AI voice generators work
Like other AI models, such as text and image models, AI voice generators are based on models trained on massive data sets. In this case, the models are trained on samples of other people speaking. OpenAI's Whisper model, for example, was trained on 680,000 hours of data. That's how it learns not only to replicate the words themselves, but the other elements of speech, such as tone and pace.
Once the model is trained, however, it doesn't need that much data in order to replicate a voice. You might not be overly impressed by the results when giving a model five minutes worth of recordings, but some can output voices that resemble that limited training data. Give it more data, and it will replicate the voice more accurately.
As the tech advances, it's getting more difficult to immediately spot forgery here. But there are some notable quirks and flaws that most AI voices tend to have, which makes spotting them crucial to identifying whether that recording is real or fake.
Listen for weird pronunciations and pacing
AI models are pretty good at mimicking the sound of a person's voice, to the point where it's tough to tell the difference at times. However, where they still struggle is in replicating the way we speak.
If in doubt, listen closely to the inflections in the speaker's "voice:" An AI bot might pronounce a word incorrectly every now and then, in a way that most people wouldn't. Yes, humans mispronounce things all the time, but be on the lookout for mistakes that might offer more of a tell. For example, "collages" might go from co-lah-jez, to co-lah-jez or co-lay-ges. You can hear these exact mistakes from Microsoft's VALL-E 2 model, if you click the first section under Audio Samples and listen to the "Clever cats" example.
The pacing might be affected, as well. While AI is getting better at replicating a normal speaking pace, it also takes weird pauses in between words, or races through others in an unnatural way. An AI model might blow past the spacing between two sentences, which will give itself away immediately. (Even a human who can't stop talking doesn't sound so robotic.) When testing out Eleven Labs' free generator, one of the outputs gave no space between my first sentence "Hey, what's up?" and my second sentence, "Thinking about heading to the movies tonight." To be fair, most attempts did include the space, but be on the lookout for moments like this when determining whether a piece of audio is legit or not.
On the flip side, it may take too long to get to the next word or sentence. While AI is getting better at replicating natural pauses and breaths (yes, some generators will now insert "breaths" before speaking) you'll also hear weird pauses in between words, as if the bot thinks that's how humans tend to talk. It'd be one thing if this was done to mimic someone thinking of the next word they want to say, but it doesn't sound like that. It sounds robotic.
You can hear these pauses in this deepfake audio of President Biden that someone made during the primary earlier this year. In the call, the fake Biden tries to persuade voters not to show up for the primary, and says, "Voting this Tuesday only enables the Republicans in their quest to elect...Donald Trump...again."
There's minimal emotion and variation in the voice
On a similar note, AI voices tend to fall a bit flat. It's not that many haven't become convincing, but if you listen close, there's less variation in tone than you'd expect from most human speakers.
It's funny, too, since these models can replicate the sound of someone's voice so accurately, but often miss the mark when it comes to impersonating the speaker's rhythms and emotions. Check out some of the celebrity examples on PlayHT's generator: If you listen to the Danny DeVito example, it's obvious that it's impersonating DeVito's voice. But you don't get some of the highs and lows of his particular way of speaking. It feels flat. There's some variance here: The bot saying "Ohh, Danny, you're Italian" sounds realistic enough. But soon after, the sentence, "I've been to the leaning tower of Pisa," doesn't match it. The last word of the recording, "sandwich," sounds especially off. The Zach Galifianakis recording further down the page has a similar issue: There are some convincing uses of "um" that make the recording sound casual, but most of the sample is without emotion or inflection.
Again, things are advancing fast here. Companies like OpenAI are training their models to be more expressive and reactive in their vocal outputs. GPT-4o's advanced Voice Mode is probably the closest a company has come yet to making an all-around convincing AI voice, especially one capable of having real-time "conversations." Even still, there are imperfections you can spot if you're listening closely. In the video below, listen to how the bot says, "opposite, adjacent, and hypotenuse" (particularly hypotenuse). Here, GPT-4o pauses, the realistic variance drops out, and the voice becomes a bit more robotic as it figures out how to string together those uncommon words.
Now, it's very subtle: The larger tells are probably the pauses it puts in between words, such as the pause before it says "opposite." In fact, the way it slows down "identify" is probably a tell as well, but it is impressive how normal the model makes it seem.
Is a celebrity or politician saying something ridiculous or provocative?
Spotting AI voices isn't just about identifying the flaws in the outputs, especially when it comes to recordings of "celebrities." When it comes to AI-generated speech of people in power and influence, those recordings are likely going to be one of two things: silly or provocative. Perhaps someone on the internet wants to make a video of a celebrity saying something funny, or a bad actor wants to convince you a politician said something that pisses you off, for example.
Most people coming across a video of Trump, Biden, and Obama playing video games together aren't actually going to think its real: This is an obvious joke. But it's not hard to imagine someone looking to throw a wrench in an election generating a fake recording of a political candidate, playing it over a video, and uploading it to TikTok or Instagram. Elon Musk shared one such video on X, featuring a fake recording of Kamala Harris, without disclosing the video was made using AI.
That's not to excuse content that is real: If a candidate says something that may question their fitness for office, it's important to take note of. But as we enter what is sure to be a divisive election season, being skeptical of these types of recordings is going to be more critical than ever.
Part of the solution here is to take a look at the source of the audio recording: Who posted it? Was it a media organization, or just some random account on Instagram? If it's real, multiple media organizations will likely pick up on it quickly. If an influencer is sharing something that aligns with their point of view without providing a proper source, take a beat before resharing it yourself.
You can try an AI voice detector (but know the limitations)
There are tools out there that advertise themselves as "AI voice detectors," able to spot whether an audio recording was generated using machine learning or not. PlayHT has one such detector, while ElevenLabs has a detector specifically looking for audio generated from the company's own tools.
As with all AI media detectors, however, take these tools with a grain of salt. AI audio detectors use AI to look for signs of generative audio content, such as absent frequencies, lack of breaths, and a robotic timbre (some of which you can listen for yourself). But these AI models are only going to be effective at identifying what they know: If they come up against audio with variables they haven't been trained on, like poor audio quality or excessive background noise, that can throw them for a loop.
Another problem? These tools are trained on the technologies available to them now, rather than the AI audio that is currently coming out or on its way. It might be able to detect any of the examples listed in this article, but if someone makes a fake Tim Walz recording tomorrow with a new model, it might not catch it.
NPR tested three AI detection tools earlier this year, and found that two of them—AI or Not and AI Voice Detector—were wrong about half the time. The other, Pindrop Security, correctly identified 81 of the 84 sample clips submitted, which is impressive.
If you have a recording you aren't sure about, you can give one of these tools a shot. Just understand the limitations of the programs you're using.