We select two samples generated by the baselines and one sample from the ESD dataset to compare with our sample.


Target : Target samples are provided from ESD dataset.

emospeech : Baseline emospeech model.

cosyvoice2 : Baseline cosyvoice2 model.

Our EASPO : Our proposed EASPO model.

Trulli
Unlike prior DPO-based methods, EASPO avoids direct propagation of preferences across diffusion steps. At each step Arcface Loss in EASPO, a set of candidate samples is produced, from which a suitable win–lose pair is chosen to update the diffusion model. Afterward, one sample is randomly picked CLEP to serve as the starting point for the next iteration.


Sample 1 (Emotion: Angry)
Text: No, I burst the balloon!
Target emospeech cosyvoice2 our EASPO
Samples


Sample 2 (Emotion: Surprise)
Text: The football teams give a tea party.
Target emospeech cosyvoice2 our EASPO
Samples


Sample 3 (Emotion: Happy)
Text: That I owe my thanks to you.
Target emospeech cosyvoice2 our EASPO
Samples


Sample 4 (Emotion: Neutral)
Text: Poor Tom now is dead.
Target emospeech cosyvoice2 our EASPO
Samples


Sample 6 (Emotion: Sad)
Text: Must a name mean something?
Target emospeech cosyvoice2 our EASPO
Samples