Creating an Audio Deepfake With YouTube

Alan Liu
6 min readJan 31, 2021

An Easy Intro to Audio Speech Synthesis

Deepfakes are realistic videos or audio created from the output of a deep neural network. Like any technology, they have the potential for both malice and great kindness. A criminal could use your voice to dupe your family members into sending them money, or a doctor could use your voice to give someone a voice they had lost. Today, with the breadth of data available on YouTube and other video sharing sites, it has become increasingly accessible to train models of public personas. It’s up to you to use this responsibly.

How

We’ll be creating personalized speech from written text. This will take about a few hours of setup and a night or two of training. However, there’s one step we’ll need to take care of first.

English can be complicated. Why is the ‘c’ in ‘cat’ pronounced like a ‘k’, but the ‘c’ in ‘cell phone’ pronounced like an ‘s’? These inconsistencies make it hard for non-native speakers to learn English, and similarly, it’s hard for models to understand as well. Luckily for us, there’s actually an alphabet of sounds called ARPABET that can translate any standard English words into their corresponding sounds. We’ll be using this as an intermediate step to teach the model what sounds we’re looking for.

Adding ARPABET notation

Afterward, we’ll be using TacoTron2 + WaveGlow (2018) as our model architecture to generate the speech. It’s a few years old, but it’s still one of the best public speech synthesis solutions right now.

TacoTron2 predicts a visual representation of speech called a mel-spectrogram from the text input. It’s a blueprint for which frequencies to include at any time. WaveGlow then takes that mel-spectrogram and synthesizes the speech from it.

We can train both Tacotron2 and WaveGlow, but training is rather expensive (since it requires GPUs…

Alan Liu

CEO/Cofounder @ Health Harbor | Formerly Nuro/Facebook/Google | Yale ’18 | alanliu.dev