Creating an Audio Deepfake With YouTube

Alan Liu
6 min readJan 31, 2021

An Easy Intro to Audio Speech Synthesis

Deepfakes are realistic videos or audio created from the output of a deep neural network. Like any technology, they have the potential for both malice and great kindness. A criminal could use your voice to dupe your family members into sending them money, or a doctor could use your voice to give someone a voice they had lost. Today, with the breadth of data available on YouTube and other video sharing sites, it has become increasingly accessible to train models of public personas. It’s up to you to use this responsibly.

How

We’ll be creating personalized speech from written text. This will take about a few hours of setup and a night or two of training. However, there’s one step we’ll need to take care of first.

English can be complicated. Why is the ‘c’ in ‘cat’ pronounced like a ‘k’, but the ‘c’ in ‘cell phone’ pronounced like an ‘s’? These inconsistencies make it hard for non-native speakers to learn English, and similarly, it’s hard for models to understand as well. Luckily for us, there’s…

--

--

Alan Liu
Alan Liu

Written by Alan Liu

CEO/Cofounder @ Health Harbor | Formerly Nuro/Facebook/Google | Yale ’18 | alanliu.dev

Responses (1)