Voice "Cloning" is Style Transfer

Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou

Cornell University  ·  Together AI  ·  Stanford University

Paper Listen to examples Key findings Code (GitHub) Dataset (HF) Cite

Voice cloning is marketed as a faithful copy of someone's voice. Across three widely used systems — ElevenLabs V3, Coqui-XTTS, and Chatterbox — we find that cloning is closer to style transfer: it systematically reshapes voices to sound warmer, more authoritative, more native-English, and more "humanlike" than the originals — and listeners trust the cloned voices more.

Study pipeline: 86 non-native English speakers record audio, three voice cloning models generate clones, 177 annotators rate paired clips.
We recorded 86 non-native English speakers reading the Grandfather Passage, cloned each recording with three TTS systems, and had 177 monolingual English annotators rate paired source and cloned clips on a 1–5 Likert scale across seven dimensions. Cloned and original recordings were reordered to prevent annotators from deciphering which ones were generated.

Listen for yourself

Each pair below contains a source recording from a non-native English speaker and a clone generated from that speaker's voice. Pay attention to accent.

Example 1
Source (human)
Cloned · Chatterbox
Example 2
Source (human)
Cloned · Chatterbox
Example 3
Source (human)
Cloned · Chatterbox

Five things we found

The x-axis is the standard 1–5 Likert response scale. E.g., a +20% shift on customer-service-likeness moves the average from roughly "slightly" to between "slightly" and "moderately".

1Not at all
2Slightly
3Moderately
4Quite a bit
5Extremely
Dumbbell chart showing cloned voices rated higher than source voices across seven Likert dimensions.
Figure 1. Paired ratings, aggregated across ElevenLabs V3, Coqui-XTTS, and Chatterbox. Every dimension shifts in the same direction: clones sound more authoritative, more warm, more native, more trustworthy, more humanlike. All differences significant at the 95% level via permutation test.
Cloned voices are rated more humanlike than the humans they were cloned from — a speech analogue of the hyperrealism seen in AI-generated faces (Miller et al., 2023).

Why this matters

The well-documented harms of voice cloning — non-consensual impersonation, voice-phishing scams, fraud against family members and businesses — are real and growing, and nothing in this work should be read as discounting them. Our findings sit alongside that record, not in place of it: even when a clone is generated with a speaker's consent for an identity-preserving use, it still doesn't sound like them.

For users who want to preserve their voice

Voice cloning is increasingly used in identity-preserving applications — finishing a recording, dubbing across languages, or restoring speech for someone who has lost theirs. In all of these settings, fidelity is the whole point. If the clone quietly removes a speaker's accent or smooths their cadence into a customer-service register, the technology is failing at its stated job.

For everyone who hears a cloned voice

Listeners trust cloned voices more than the originals, and report being more willing to disclose personal information to them. As synthetic voices show up in customer service, claims processing, and increasingly in scam calls, that asymmetry is a behavioral safety story, not just an aesthetic one.

For cultural diversity

The homogenization isn't random — it points at a specific prototype: fluent, "Standard" English, often Anglophone. The Sankey diagram below shows source accents on the left and the classifier's read of the cloned voice on the right. Coqui-XTTS pulls every speaker into one of the five Anglophone varieties (US, UK, Canadian, Australian, or New Zealand English).

Three side-by-side flow diagrams showing speakers reclassified from non-native to Anglophone English after cloning. Coqui-XTTS reassigns 100% of non-native speakers; Chatterbox and ElevenLabs each reassign 78%.
Figure 2. Source accent (left bar) vs. CommonAccent classification of the cloned audio (right bar). Red ribbons trace speakers whose accent gets reassigned from a non-native variety to an Anglophone one. Coqui-XTTS reassigns 100% of non-native speakers; Chatterbox and ElevenLabs each reassign 78%.

What happens if you clone a clone?

We took each speaker's audio and ran 50 rounds of iterative cloning with Chatterbox. If cloning were faithful, the recordings should hover near the source. Instead, embeddings drift in a consistent direction, then converge into a single cluster — pitch climbs, similarity to the original collapses, and male and female speakers end up indistinguishable in embedding space.

Figure 3. 50 rounds of iterative cloning with Chatterbox, visualized as a PCA of ECAPA-TDNN speaker embeddings. Each frame is one round of recursive cloning. At round 0 (the source recordings), male (blue) and female (red) speakers form clearly separated clouds. By round 50, the two clouds have collapsed into a single cluster — the model is converging to a fixed point in embedding space.
Three side-by-side PCA scatter plots at rounds 1, 25, and 50. Round 1 shows two clearly separated male and female clusters. Round 25 shows overlap beginning. Round 50 shows a single merged cluster.
Figure 3b. Static snapshots from the animation above — rounds 1, 25, and 50, same PCA basis.

How we did it

We recruited 86 non-native English speakers via Prolific (sex-balanced, ages 19–64, self-reported accent strength 0–10) and asked them to read the Grandfather Passage — a nine-sentence standard text used in speech assessment. We split each recording into sentence-level clips, quality-checked them, and used cross-sentence cloning: the model gets sentence  as reference and is asked to produce sentence ℓ+1, so it must extract generalizable speaker features rather than copy phonetic content.

Annotators were monolingual US English speakers recruited via Prolific. Each session shuffled 10 source and 10 cloned clips from one model and one speaker-sex, so annotators were blind to which clips were human. We collected 4,000 paired annotations from 177 annotators.

Paper: https://arxiv.org/abs/2605.16578
Code: github.com/kzhou-cloud/voice-cloning-public
Dataset: huggingface.co/datasets/kzhou/voice_cloning_style_transfer

Cite this work

@article{zhou2026voicecloning, title = {Voice "Cloning" is Style Transfer}, author = {Zhou, Kaitlyn and Bianchi, Federico and Bartelds, Martijn and Pot, Anna and Kwon, Yongchan and Zou, James}, year = {2026}, note = {Preprint} }