![]() In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. ![]() The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. ![]() Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |