Spotify's New AI Voice Translation Tool Powered By OpenAI's Whisper

· 5 min read
Spotify makes AI voice clones of podcasters and uses them to speak other languages / Manipulation by Chanuka Nadun Perera

Spotify recently announced its AI-powered voice translation feature. This will be a game-changing feature for a huge podcaster and daily podcast listener like me.

This tool takes English-language episodes and seamlessly transforms them into various languages, starting with Spanish translations and soon expanding to include French and German translations.

Imagine tuning in to your favorite podcast and experiencing it in a language you understand without losing the essence of the podcaster's voice.

Not only that, It can do it both ways. It can translate other languages into English as well. This feature will open the door to a whole other universe of podcasts.

Podcasting in Spotify, does it matter?

Spotify, the music streaming giant, aims to achieve a staggering 1 billion users by 2030 and reach an impressive $100 billion annual revenue.

But what's driving this growth? Part of their strategy involves a significant investment in podcasts and audiobooks, which they expect to yield high-margin returns.

Remarkably, Spotify has already claimed the title of the most-used audio podcast platform globally. Not stopping there, they've also become the No. 1 podcast publisher in the United States, a testament to their influence in the podcasting landscape.

Spotify is free to listen to; anyone with a free Spotify account can enjoy a vast library of podcasts. However, to download podcasts for offline listening and saving songs and albums, a Premium account is required, and that's totally ok.

Spotify already has a collection of over 100 million songs. But that's not all; they also offer more than 5 million podcast titles and 350,000+ audiobooks as well.

The numbers behind Spotify's success are staggering. They currently have a whopping 551 million users worldwide, with an impressive 220 million of them being premium subscribers spread across 184 regions.

Spotify's podcasting journey began three years ago, and the growth has been nothing short of extraordinary. They started with a database of 180,000 episodes, and today, their platform hosts over 5 million podcasts and a staggering 100,000 video podcasts. That's a growth rate of over 1500%. To put this into perspective, in 2021 alone, Spotify added a staggering 1.2 million podcasts to its already extensive library.

Spotify already has some competition with a few of the tech giants. They are a bit late to the AI Speech game. For example, YouTube has rolled out its groundbreaking AI-powered dubbing Service, changing how we engage with videos. Meanwhile, Meta, the company behind Facebook, is making waves with its innovation, Voicebox.

Spotify Podcast AI Translation feature Pilot Program and Testing phase

Voice-translated episodes from pilot creators will be available worldwide to Premium and Free users. / Spotify

Currently in the testing phase, this exciting feature is making its debut with the help of some trailblazing content creators, aka pilot creators.

The best part? It's not just for the select few; these translated episodes will be accessible to both Premium and Free users around the world.

Starting with Spanish translations and with French and German translations following suit, this is a game-changer for podcast enthusiasts.

Imagine tuning into the popular Lex Fridman Podcast and being able to enjoy an "Interview with Yuval Noah Harari" in your native language.

Spotify’s AI Voice Translation Pilot Means Your Favorite Podcasters Might Be Heard in Your Native Language

Or perhaps catching up with Armchair Expert, where "Kristen Bell, by the grace of god, returns" in a language you're comfortable with.

And don't miss out on The Diary of a CEO with Steven Bartlett, featuring an "Interview with Dr. Mindy Pelz" - now, you can experience it all in your preferred language, thanks to the magic of AI-powered voice translations.

Podcasters like Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett have played a crucial role in bringing this innovation to life.
Here’s one of Spotify’s translations on an episode of Fridman’s show:

The Technology Behind the Magic of Spotify Podcast Translation

OpenAI's remarkable voice transcription tool, Whisper, is at the heart of this groundbreaking feature. This tool not only accurately transcribes English speech but also achieves the remarkable task of translating foreign languages into English.

Openai's remarkable voice transcription tool whisper

Whisper is an advanced automatic speech recognition (ASR) system powered by OpenAI. 

Trained on a whopping 680,000 hours of multilingual and multitask supervised data, it's a powerhouse when it comes to understanding spoken language. Whisper has the ability to perform transcription in multiple languages.

At the heart of Whisper lies its sophisticated architecture known as the Encoder-decoder Transformer. This technology splits input audio into manageable 30-second segments, converts them into a Log-Mel spectrogram, and processes them through an encoder. 

The decoder is then trained to predict text captions and is equipped with specialized tokens for various tasks like language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

While it may not outperform specialized models in benchmarks like LibriSpeech performance, Whisper truly shines in zero-shot performance, making a remarkable 50% fewer errors than its counterparts across diverse datasets.

Whisper proves to be incredibly versatile in handling various language-related tasks. It can seamlessly transcribe multiple languages, translate spoken content, and accurately detect languages. This makes Whisper an essential and highly valuable tool for a wide range of language-related assignments.

It even supports Whisper classification in OpenAI, offering groundbreaking capabilities in audio classification.

Comparing it to Google's Chirp AI, Whisper stands out in terms of accuracy and is often the preferred choice, particularly for cost-effective solutions. 

And if you need to transcribe larger audio files, consider the Azure OpenAI Whisper model, which can handle files up to 1GB, surpassing the 25MB limit of standard models.

Technically, you can use Whisper for free and offline for AI audio transcription, but only for a limited time. So, whether you're dealing with large audio files or need accurate speech recognition and translation, Whisper, powered by Nvidia Cuda, has got you covered.

Spotify's Vice President of Personalization, Ziad Sultan, emphasizes the power of Voice Translation in connecting podcasters with a global audience authentically.

"By matching the creator's own voice, Voice Translation gives listeners around the world the power to discover and be inspired by new podcasters in a more authentic way than ever before."

How does OpenAI fit into this innovative equation? OpenAI, known for its cutting-edge AI advancements, is likely the driving force behind the voice replication aspect of this new feature.

George R.R. Martin Joins Authors Guild in OpenAI Lawsuit
Dive into the lawsuit where George R.R. Martin challenges OpenAI over copyright and creativity.

They have recently unveiled the tool Whisper, capable of creating human-like audio from just text and a few seconds of sample speech.

However, OpenAI is proceeding cautiously with this new technology Because there are already huge concerns about safety and privacy.

So, for now, Spotify limits the tool's availability to a selected group. The company has yet to disclose the full extent of its plans regarding AI-powered podcast translation Tool availability or expansion.