StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- Samples: https://styletts2.github.io/
- Paper: https://arxiv.org/abs/2306.07691
- Repo: https://github.com/yl4579/StyleTTS2
StyleTTS 2 is very appealing since the quality is very high and it’s also flexible, supporting multi-speaker, zero-shot speaker adaptation, speech expressiveness, and style transfer (speech and style vectors are separated).
It also turns out the inferencing code appears to be very fast, beating out TTS VITS by a big margin (and XTTS by an even bigger margin). Note, all of these generate faster than real-time on an RTX 4090, although for StyleTTS 2, I’m seeing up to 95X and XTTS is barely faster at about 1.4X.
This write-up is done on the first day after release, and only adapting the LJSpeech inferencing ipynb code to a Python script. The instructions weren’t in too bad a state. You can see this post also for a quick comparison of StyleTTS 2 vs TTS VITS vs TTS XTTS output.
Recommended System Pre-requisites
espeak-ng
- you need this- CUDA - you could probably use CPU or ROCm but idk
- Mamba - not required but will make your life a lot easier
Environment setup:
Get models:
Inferencing
- Well, just use Inference_LJSpeech.ipynb basically. Latest version should work.
My changes, mainly include doing file output:
Oh, and I like to output some more timing stuff, eg:
Personally, I find RT X multiple more intuitive, especially once you get to higher multiples.
To be continued when I have a chance to get to training…