Abstract : Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion–semantic guided latent modulation, relation-preserving emotional–semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text to speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

Trulli
Overview of the proposed emotion-guided neural speech codec. The codec encodes input speech into discrete acoustic representations via residual vector quantization (RVQ) and incorporates emotion- and semantic-aware mechanisms to preserve emotional expressiveness. Specifically, it integrates (ii) emotion-guided latent modulation, which injects affective and semantic cues into acoustic latents prior to quantization, (iii) relation-preserving distillation, which constrains discrete representations to retain relational structure from emotion and semantic spaces, and (iv) emotion-weighted semantic alignment, which aligns quantized tokens with textual semantics while emphasizing emotionally salient regions to maintain semantic fidelity and prosodic naturalness.

Codec Reconstruction Samples


Original Encodec Llasa Our
Samples 1
Original Encodec Llasa Our
Samples 2
Original Encodec Llasa Our
Samples 3
Original Encodec Llasa Our
Samples 4
Original Encodec Llasa Our
Samples 5
Original Encodec Llasa Our
Samples 6
Original Encodec Llasa Our
Samples 7
Original Encodec Llasa Our
Samples 8
Original Encodec Llasa Our
Samples 9
Original Encodec Llasa Our
Samples 10
Original Encodec Llasa Our
Samples 11
Original Encodec Llasa Our
Samples 12

Zero-shot TTS Generation Samples


Sample 1 (Emotion: Angry)
Text: No, I burst the balloon!
Target F5-TTS cosyvoice2 Our
Samples


Sample 2 (Emotion: Surprise)
Text: The football teams give a tea party.
Target F5-TTS cosyvoice2 Our
Samples


Sample 3 (Emotion: Happy)
Text: That I owe my thanks to you.
Target F5-TTS cosyvoice2 Our
Samples


Sample 4 (Emotion: Neutral)
Text: Poor Tom now is dead.
Target F5-TTS cosyvoice2 Our
Samples


Sample 6 (Emotion: Sad)
Text: Must a name mean something?
Target F5-TTS cosyvoice2 Our
Samples


Sample 7 (Emotion: Angry)
Text: You are not a runaway, who are you?
Target F5-TTS cosyvoice2 Our
Samples


Sample 8 (Emotion: Sad)
Text: 爱上一个人的重要标志就是,遇上任何美景都在遗憾,为何你不在身边.
Target F5-TTS cosyvoice2 Our
Samples


Sample 9 (Emotion: Happy)
Text: 司马鹰扬笑了,笑得有点儿得意.
Target F5-TTS cosyvoice2 Our
Samples


Sample 10 (Emotion: Angry)
Text: 受到处罚你可不能怨别人,知道吗,臭小子!
Target F5-TTS cosyvoice2 Our
Samples


Sample 11 (Emotion: Surprise)
Text: 哈?谁要把它送给我,长大了我就嫁给他!
Target F5-TTS cosyvoice2 Our
Samples


Sample 12 (Emotion: Neural)
Text: 米琦,你出来,我有话要跟你说.
Target F5-TTS cosyvoice2 Our
Samples