For a given triplet training corpus, the traditional high-quality end-to-end speech translation system adopts a pre-training model and then further optimizes it.
However, this process only involves the binary data of each stage, and this loose coupling fails to make full use of the correlation between ternary data. Our work attempts to model the joint probability of transcription and translation based on speech input, so as to directly use such triple data. On this basis, a new regularization training method of triangular decomposition consistency is proposed to improve the consistency of dual path decomposition.
To solve this problem, Du Yichao, a master student of the Chinese University of Science and Technology, interprets his team's solution in this issue of AI Drive, which is also their latest work published in AAAI 2022: end-to-end speech translation with triangular decomposition consistency constraints.
This work was jointly completed by the University of Science and Technology of China, Alibaba Dharma Institute, Rutgers University and Tencent AI Lab.
The content of this issue will be mainly divided into (live ppt can be obtained in the background of "Data Fighting School"):
1. Background knowledge: Definition and correlation of speech translation tasks? Make a review.
2. The method proposed in this study: 3? Decomposition? End-to-end language of sexual constraints? Translation? law
3. Experimental analysis: experimental performance and correlation analysis on benchmark data sets.
4. Summary
AI Drive no. 98- Du Yichao, China University of Science and Technology: End-to-end Speech Translation with Triangulation Consistency Constraints-Bi Li Position Protocol of Bi Li (e2e- Saint TDA). From the model, it includes an encoder and a decoder. In the encoder part, the speech signal is down-sampled by using two one-dimensional convolution layers, and the encoder output is obtained by using the transform encoder. Decoder is the core of our method, which is divided into two steps. The first step is to decode the dual path at the target end, so that all the data are included in the same model for training. The second step is to close the output representation between the two paths through regularization.
First of all, we give the concrete flow of dual-channel decoding. After receiving the encoder output, the decoder carries out joint modeling on the transcribed text and the translated text at the target end, that is, outputs the joint sequence of the transcribed text and the translated text. We distinguish different decoding paths by the language identifier, for example, before the language identifier.
According to the chain decomposition rule, the two formulas mentioned above should be consistent in theory, but in the actual training process, the optimization of the two paths is independent, and the learning difficulty is different from the prior knowledge, which may not be equal in practice. In order to solve this problem, we introduce two regularization terms based on probability output to reduce the mismatch between the sequences generated by the two paths. Specifically, taking "dog" as an example, the mismatch between the two paths is eliminated by using KL divergence as a regularization term, and the probability distribution of "dog" output by ASR-MT and "dog" output by ST-BT is reduced. This process can be formalized as the following figure.
Finally, we can use pictures to optimize the training of the model. In reasoning, for ST, we choose ST-BT path to decode, and after recognition,
In the experimental part, we use the largest open source speech translation data set MuST-C to evaluate the proposed method. The audio data of this data set comes from TED Talks, which contains triplet data from English to 8 European languages. The following figure shows the specific statistical data and comparison methods. For the model setting, it is divided into small scale model and medium scale model. We use BLEU score to evaluate translation quality and WER to evaluate ASR performance.
The following table shows the performance results of various methods on the MuST-C test set. We can observe that our method e2e- St TDA has achieved the best results in all aspects.
Meanwhile, the performance of ASR task is improved by 1.5/ 1.9 compared with the baseline model. This shows that our method can improve the performance of ASR and ST at the same time after mining the association between triples.
In addition, we also carried out experiments in a more realistic scenario, that is, with larger data. For audio data, we extend 960h Librispeech ASR data; For text data, we extend WMT 14 En-De/Fr data. Two conclusions are drawn from the experiment: e2e- Santa TDA can be effectively extended to large-scale data scenes and achieve SOTA performance; Large-scale data can effectively improve translation performance.
Next is the ablation experiment, which compares the situation of WordKD/SeqKD and removing KL regularization term. The experimental results show that the regularization term can effectively help the model reduce the mismatch between paths.
In order to further verify whether the parameter scale will affect the performance gain, we use the embedding dimension of (256,512,768, 1024) for the MuST-C En-De term. The detailed results are shown in the figure below. With the increase of embedding dimension, the trend of performance gain tends to be consistent with the performance curve of the basic model, which shows that our model has certain robustness.
In this study, we propose a new triangular decomposition of unified regularization method. This improves the overall translation performance by exploring the correlation between triple data. Two regularization terms are added to eliminate the mismatch between dual paths. At the same time, experiments on benchmark data sets verify the effectiveness of this method.
The data war faction hopes to help readers improve their business ability and build an interesting big data community with real data and practical cases in the industry.