Ethical Hacking News

A New Era in Voice Cloning: Zyphra's Breakthrough AI Model

Zyphra's latest TTS model, Zonos, has the potential to clone your voice with just five seconds of audio, making it one of the most impressive and alarming advancements in voice cloning history. With great power comes great responsibility, but this technology also holds promise for benevolent uses, such as helping those with speech disorders or accessibility needs.

Zyphra's Zonos TTS model can clone your voice with just five seconds of audio.

The model was trained on over 200,000 hours of speech data from various languages.

Zonos has two models: a fully transformer-based architecture and a hybrid model combining transformer and Mamba SSM architectures.

The model weights are available under an Apache 2.0 license on Hugging Face.

The technology has both practical and controversial applications, including scamming victims and enhancing accessibility for those with speech disorders.

Recently, a groundbreaking announcement was made by Palo Alto-based AI startup Zyphra regarding its latest text-to-speech (TTS) model, Zonos. This innovative technology has the potential to clone your voice with just five seconds of audio, making it one of the most impressive and alarming advancements in voice cloning history.

The story begins with the release of Zyphra's Zamba family of small language models, optimizations such as tree attention, and now the release of its Zonos TTS models. The Zonos models are capable of producing realistic results with less than half a minute of recorded speech, making them a formidable tool for various applications.

Zyphra's Zonos TTS models were trained on over 200,000 hours of speech data, which includes both neutral-toned speech such as audiobook narration and "highly expressive" speech. The majority of this data was in English, but there were also substantial quantities of Chinese, Japanese, French, Spanish, and German.

The results are actually two Zonos models: one that uses a fully transformer-based architecture and the other, a hybrid model that combines transformer and Mamba state space model (SSM) architectures. The latter, Zyphra claims, makes it the first TTS model to use this arch.

From a practical standpoint, both models behave similarly to other text-to-speech models. However, unlike those developed by ElevenLabs and others, Zyphra has elected to release its model weights on Hugging Face under a permissive Apache 2.0 license.

In an effort to test the capabilities of Zonos, we spun up the demo locally on an Nvidia RTX 6000 Ada Generation graphics card. We then uploaded 20- to 30-second clips of ourselves reading a random passage of text and fed that into the Zonos-v0.1 transformer and hybrid models along with a 50 or so word text prompt, leaving all hyperparameters to their defaults.

Using a 24-second sample clip, we were able to achieve a voice clone good enough to fool close friends and family — at least on first blush. After revealing that the clip was AI generated, they did note that the pacing and speed of the speech did feel a little off, and that they believed they would have caught on to the fact the audio wasn't authentic given a longer clip.

Zyphra offers a demo environment where you can play with its Zonos models, along with paid API access and subscription plans on their website. However, if you're hesitant to upload your voice to a random startup's servers, getting the model running locally is relatively easy.

To get started, we'll use git to pull down the Zonos repo: git clone https://github.com/Zyphra/Zonos.git

From there, we'll navigate into the folder and spin up the container using Docker Compose: cd Zonos docker compose up

After a few seconds, you should be able to access the Gradio web GUI by navigating to http://localhost:7860 or, if you're running this remotely, you'll need to swap localhost for the machine's IP address or hostname.

Once you've got everything dialed in, click on Generate Audio. Depending on your hardware and the length of your input text, this could take anywhere from a few seconds to minutes. Once complete, the clip should begin playing automatically.

However, with great power comes great responsibility. The voice cloning capabilities presented by Zonos are inherently controversial, from where the training data was mined to how they're actually used in practice. Considering just how little sample audio is required to achieve a passable result, it's easy to see how this technology could be abused.

Companies like Audible are exploring text-to-speech AI to expand audiobook production, allowing narrators to create AI-generated voice clones of themselves. Meanwhile, legal challenges surrounding AI voice cloning are already hitting similar businesses.

We can also see this technology used to scam unsuspecting victims into believing that a loved one is in trouble, and that they just need a few hundred dollars worth of gift cards to get them out of a bind. Or to ruin someone's career by using it to make an abusive call with their voice to their boss. Or generate fake political messages, or... the examples are endless.

Having said that, there are also benevolent uses for these kinds of models. From an accessibility standpoint, voice cloning and text-to-speech could help someone who has suffered trauma to their vocal cords, or has conditions affecting speech, get their voice back. In fact, this is one of the reasons that Apple gave to justify the inclusion of voice cloning tech in iOS in late 2023.

The fact that this technology is already widely available — whether on iDevices or through paid services or as open source models — is why we're even comfortable demonstrating how to deploy and run Zonos locally in the first place.

With that said, if you do choose to embrace AI text-to-voice capabilities, we encourage you to do so in the most respectful and responsible way possible.

Related Information:

https://go.theregister.com/feed/www.theregister.com/2025/02/16/ai_voice_clone/

https://www.theregister.com/2025/02/16/ai_voice_clone/

https://github.com/Zyphra/Zonos

Published: Sun Feb 16 20:18:23 2025 by llama3.2 3B Q4_K_M

Today's cybersecurity headlines are brought to you by ThreatPerspective

A New Era in Voice Cloning: Zyphra's Breakthrough AI Model