In order to address the exploding and vanishing gradient problems in RNN, the long short-term memory (LSTM) structure was proposed and successfully kept track of arbitrary long-term dependencies between the elements in the input sequences. The Recurrent Neural Network (RNN) architecture has an elegant way of dealing with sequential problems since it is able to embody correlations between samples in the sequence. Figure 1: The network architecture of our proposed system Results show that either the word-level conditional feature or the sentence-level conditional feature yields significant improvement on polyphone disambiguation. We investigate three systems under different combinations of conditional features on a publicly available dataset. The prediction network maps the polyphonic character embedding features and the auxiliary features to their unique pronunciation. Besides, we use a publicly released and pre-trained word-to-vector dictionary for word-level conditional vector lookup. Basically, we embed each character in the sentence and adopt the bi-directional long short-term memory (BLSTM) structure to accumulate the forward context information and backward context information as the conditional feature in the sentence-level. In the light of these two characteristics, we first design an encoder module using a recurrent neural network (RNN) structure to extract the sentence-level encoding feature as the context condition. Previous research works in polyphonic character show that: 1) The utilization of context is an effective way to solve the pronunciation disambiguation of Chinese polyphonic characters 2) Most polyphonic word, which comprises by polyphonic character, could be used to determine the pronunciation of the polyphonic character. Besides using the polyphonic character embedding feature as the network input, we obtain auxiliary features from the corresponding sentence as a condition for predicting the correct pronunciation. In this paper, we introduce a data-driven approach using the conditional neural network architecture for polyphone disambiguation. This issue is also considered to be a homograph problem, which has important applications in speech synthesis and is still not solved today. Therefore, other than the G2P system, the polyphone disambiguation system is developed to choose the correct pronunciation of a polyphonic character from several candidates based on the context. This kind of characters is called polyphonic characters. Yet one single Chinese character could have several different pronunciations in terms of different usages in a sentence. While the G2P system in English TTS synthesis system aims to produce the phoneme sequences for the out-of-lexicon words, The target of a G2P system in Chinese TTS synthesis system is to convert Chinese characters to pinyins (phoneme representations with Latin alphabet in Mandarin Chinese). It appears to be a suitable choice of using phonemes or syllables as units for a TTS synthesis system in a way for effective and better performance. However, the number considerably declines to 1300 when converting the characters into phonologically allowed syllables, and even less when using Latin alphabet representation. According to the characteristics of Mandarin Chinese, there are at least 13000 commonly used Chinese characters. G2P typically generates a sequence of phones from a sequence of characters or graphemes. The grapheme-to-phoneme (G2P) conversion is a fundamental front-end procedure in the Chinese Text-to-Speech (TTS) synthesis system, either the traditional HMM-based speech synthesis system or the End-to-End speech synthesis system. The experimental results show that both the sentence-level and the word-level conditional embedding features are able to attain good performance for Mandarin Chinese polyphone disambiguation. To further validate our choices on the conditional feature, we investigate polyphone disambiguation systems with multi-level conditions respectively. Our system achieves an accuracy of 94.69% on a publicly available polyphonic character dataset. One goal of polyphone disambiguation is to address the homograph problem existing in the front-end processing of Mandarin Chinese text-to-speech system. We obtain the word-level condition from a pre-trained word-to-vector lookup table. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, followed by a prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations. This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |