Abstract : Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion–semantic guided latent modulation, relation-preserving emotional–semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text to speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.
Codec Reconstruction Samples
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 1 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 2 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 3 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 4 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 5 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 6 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 7 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 8 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 9 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 10 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 11 |
| Original | Encodec | Llasa | Our | |
|---|---|---|---|---|
| Samples 12 |
Zero-shot TTS Generation Samples
Sample 1 (Emotion: Angry)
Text: No, I burst the balloon!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 2 (Emotion: Surprise)
Text: The football teams give a tea party.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 3 (Emotion: Happy)
Text: That I owe my thanks to you.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 4 (Emotion: Neutral)
Text: Poor Tom now is dead.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 6 (Emotion: Sad)
Text: Must a name mean something?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 7 (Emotion: Angry)
Text: You are not a runaway, who are you?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 8 (Emotion: Sad)
Text: 爱上一个人的重要标志就是,遇上任何美景都在遗憾,为何你不在身边.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 9 (Emotion: Happy)
Text: 司马鹰扬笑了,笑得有点儿得意.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Text: The football teams give a tea party.
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Sample 3 (Emotion: Happy)
Text: That I owe my thanks to you.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 4 (Emotion: Neutral)
Text: Poor Tom now is dead.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 6 (Emotion: Sad)
Text: Must a name mean something?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 7 (Emotion: Angry)
Text: You are not a runaway, who are you?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 8 (Emotion: Sad)
Text: 爱上一个人的重要标志就是,遇上任何美景都在遗憾,为何你不在身边.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 9 (Emotion: Happy)
Text: 司马鹰扬笑了,笑得有点儿得意.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Text: Poor Tom now is dead.
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Sample 6 (Emotion: Sad)
Text: Must a name mean something?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 7 (Emotion: Angry)
Text: You are not a runaway, who are you?
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 8 (Emotion: Sad)
Text: 爱上一个人的重要标志就是,遇上任何美景都在遗憾,为何你不在身边.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 9 (Emotion: Happy)
Text: 司马鹰扬笑了,笑得有点儿得意.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Text: You are not a runaway, who are you?
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Sample 8 (Emotion: Sad)
Text: 爱上一个人的重要标志就是,遇上任何美景都在遗憾,为何你不在身边.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 9 (Emotion: Happy)
Text: 司马鹰扬笑了,笑得有点儿得意.
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Text: 司马鹰扬笑了,笑得有点儿得意.
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target
F5-TTS
cosyvoice2
Our
Samples
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Text: 哈?谁要把它送给我,长大了我就嫁给他!
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |
Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target
F5-TTS
cosyvoice2
Our
Samples
| Target | F5-TTS | cosyvoice2 | Our | |
|---|---|---|---|---|
| Samples |