Abstract:

This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and shows ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator, which achieves 5x acceleration than the benchmark model.
Keywords:
Text-to-speech, graph neural network, syntactic modelling, speech synthesis

Contents

Single Speaker (LJ Speech Dataset)


Text

" The jury did not believe him, and the verdict was for the defendants. " " He had repeated this wish only a few days before, during his visit to Tampa, Florida. " " The poorer prisoners were not in abject want, as in other prisons. "
Ground Truth
VITS
FastGraphTTS

Single Speaker (Biaobei Dataset)


Text

" 其他跨国公司的研发中心包括,飞思考尔、思科网讯等。 " " 母熊直立着,一副随时要扑过来的样子。 " " 据悉,领馆面朝一条较窄的马路,夜间很少有行人经过。 "
Ground Truth
VITS
FastGraphTTS

Few-shot (English Dataset)


Source speaker:

Text

" Since these agencies are already obliged constantly to evaluate the activities of such groups, " " would not agree with that particular wording, end quote. " " Soon afterwards Dixon died, showing all the symptoms already described. "
VITS
FastGraphTTS

Few-shot (Mandarian Dataset)


Source speaker:

Text

" 这趟慢车的乘客逐年减少。 " " 这样一来,供应商必然不会铤而走险。 " " 你总能从多视角看待事物,从而解决一些很棘手的问题! "
VITS
FastGraphTTS

Multi-Speaker (VCTK Dataset)


Text

" People come into the Borders for the beauty of the background. " " The rainbow is a division of white light into many beautiful colors. " " the greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. "
Ground Truth
VITS
FastGraphTTS

Multi-Speaker (AI-Shell-3 Dataset)


Text

" 确实没有必要将自己的未来全部拴在房子上。 " " 当老人遇到困难或身体不舒适时。 " " 以及可能存在的风险和不确定因素。 "
Ground Truth
VITS
FastGraphTTS

Voice Conversion (VCTK Dataset)

From\To Speaker A Speaker B
Speaker A
Speaker B

Voice Conversion (AI-Shell-3 Dataset)

From\To Speaker A Speaker B
Speaker A
Speaker B

Prosody Analysis (LJ Speech Dataset)


Text

" Since these agencies are already obliged constantly to evaluate the activities of such groups, " " At first Mrs. Connally thought that her husband had been killed. " " and one or two men were allowed to mend clothes and make shoes. The rules made by the Secretary of State were hung up in conspicuous parts of the prison; "
It can be seen from the red dashed box in the figures, that VITS sometimes randomly generates accented audio clips, with no relation with the actual speech syntax. For example, "such" in the first sentence does not have the meaning of stress, but VITS assigns obvious stress to this word, which is inconsistent with the syntactic information of the text. This random stress situation is resolved in this paper by FastGraphTTS.
VITS
FastGraphTTS