researchopenworld.com

Article Page

Application of Hybrid CTC/2D-Attention End-to-End Model in Speech Recognition during the COVID-19 Pandemic

DOI: 10.31038/MGJ.2021423

Abstract

Recent research in the field of speech recognition has shown that end-to-end speech recognition frameworks have greater potential than traditional frameworks. Aiming at the problem of unstable decoding performance in end-to-end speech recognition, a hybrid end-to-end model of connectionist temporal classification (CTC) and multi-head attention is proposed. CTC criterion was introduced to constrain 2D-attention, and then the implicit constraint of CTC on 2D-attention distribution was realized by adjusting the weight ratio of the loss functions of the two criteria. On the 178h Aishell open source dataset, 7.237% word error rate was achieved. Experimental results show that the proposed end-to-end model has a higher recognition rate than the general end-to-end model, and has a certain advance in solving the problem of mandarin recognition.

Keywords

Speech recognition, 2-Dimensional multi-head attention, Connectionist temporal classification, COVID-19

Introduction

Speech recognition technology is one of the important research directions in the field of artificial intelligence and other emerging technologies. Its main function is to convert a speech signal directly into a corresponding text. Yu Dong et al. proposed deep neural network and hidden Markov model, which has achieved better recognition effect than GMM-HMM system in continuous speech recognition task [1-3]. Then, Based on Recurrent Neural Networks (RNN) [4,5] and Convolutional Neural Networks (CNN) [6-11], deep learning algorithms are gradually coming into the mainstream in speech recognition tasks. And in the actual task they have achieved a very good effect. Recent studies have shown that end-to-end speech recognition frameworks have greater potential than traditional frameworks. The first is the Connectionist Temporal Classification (CTC) [12], which enables us to learn each sequence directly from the end-to-end model in this way. It is unnecessary to label the mapping relationship between input sequence and output sequence in the training data in advance so that the end-to-end model can achieve better results in the sequential learning tasks such as speech recognition. The second is the encode-decoder model based on the attention mechanism. Transformer [13] is a common model based on the attention mechanism. Currently, many researchers are trying to apply Transformer to the ASR field. Linhao Dong et al. [14] introduced the Attention mechanism from both the time domain and frequency domain by applying 2D-attention, which converged with a small training cost and achieved a good effect. And Abdelrahman Mohamed [15] both used the characterization extracted from the convolutional network to replace the previous absolute position coding representation, thus making the feature length as close as possible to the target output length, thus saving calculation and alleviating the mismatch between the length of the feature sequence and the target sequence. Although the effect is not as good as the RNN model [16], the word error rate is the lowest in the method without language model. Shigeki Karita et al. [17] made a complete comparison between RNN and Transformer in multiple languages, and the performance of Transformer has certain advantages in every task. Yang Wei et al. [18] proposed that the hybrid architecture of CTC+attention has certain advancement in the task of Mandarin recognition with accent. In this paper, a hybrid end-to-end architecture model combining Transformer model and CTC is proposed. By adopting joint training and joint decoding, 2D-Attention mechanism is introduced from the perspectives of time domain and frequency domain, and the training process of Aishell dataset is studied in the shallow encoder-decoder network.

Hybrid CTC/Transformer Model

The overall structure of the hybrid CTC/Transformer model is shown in Figure 1. In the hybrid architecture, chained chronology and multi-head Attention are used in the process of training and grading, and CTC is used to restrain Attention and further improve the recognition rate.

fig 1

Figure 1: An encoder-decoder architecture based on Transformer and CTC.

In end-to-end speech recognition task, the goal is through a network, the input 𝑥 ㉠𝑥1,…,𝑥𝑇 , calculate all output tags sequence 𝑦 ㉠ 𝑦1,…,𝑦𝑀 corresponding probability, usually 𝑀 ≤ 𝑇, 𝑦𝑚 ∈ 𝐿, 𝐿 is a finite character set, the final output is one of the biggest probability tags sequence, namely

𝑦* ㉠ argmax 𝑃i𝑦g𝑥s (1)

𝑦

Connectionist Temporal Classification

Connectionist Temporal Classification structure as shown in Figure 2, in the training, can produce middle sequence 𝜋 ㉠ 𝜋1,…,𝜋𝑂 , in sequence 𝜋 allow duplicate labels, and introduce a blank label 𝑏𝑙𝑎𝑛𝑘: <−> have the effect of separation, namely 𝜋i ∈ 𝐿 𝖴 𝑏𝑙𝑎𝑛𝑘 . For example 𝑦 ㉠ wo,ai,ni,Zhong,guo ,

𝜋 ㉠ − ,wo, − , − ,ai,ai, − ,ni,ni, − ,zhong,guo, − ,

fig 2

Figure 2: Connectionist Temporal Classification.

𝑦’ ㉠ − ,wo, − ,ai, − ,ni, − ,zhong, − ,guo, − , this is equivalent to construct a many-to-one mapping 𝐵:𝐿’ → 𝐿≤𝑇 , The 𝐿≤𝑇 is a possible 𝜋 output set in the middle of the sequence, and then get the probability of final output tag:

𝑥 ㉠ ∑𝜋∈ 𝐵−1i𝑦’s 𝑃i𝜋g𝑥s （2）

Where, 𝑆 represents a mapping between the input sequence and the corresponding output label; 𝑞𝑡 represents label 𝜋𝑡 at time 𝑡 corresponding probability. All tags sequence in calculation through all the time, because of the need to 𝑁𝑇 iteration, 𝑁 said tag number, the total amount of calculation is too big. HMM algorithm can be used for reference here to improve the calculation speed:

𝑃𝑐𝑡𝑐𝑦 𝑥 ㉠ ∑𝑇𝑦’𝑢㉠1𝛼 𝛽𝑡i𝑢s𝑡𝜋𝑡 (3)

Among them, the 𝛼𝑡 𝑢 𝛽𝑡i𝑢s respectively are forward probability and posterior probability of the -th label at the moment to. Finally, from the intermediate sequence to the output sequence, CTC will first recognize the repeated characters between the delimiters and delete them, and then remove the delimiter <−>.

Transformer Model for 2D-attention

Transformer model is used in this paper, which adopts a multi-layer encoder-decoder model based on multi-head attention. Each layer in the encoder should be composed of a 2-dimensional multi-head attention layer, a fully connected network, and layer normalization and residual connection. Each layer in the decoder is composed of a 2-dimensional multi-head attention layer that screens the information before the current moment, an attention layer that calculates the input of the encoder, a three-layer network that is fully connected, and a layer normalization and residual connection. The multi-head attention mechanism first initializes the three weight matrices.

𝑄Ǥ𝐾Ǥ𝑉 by means of linear transformation of the input sequence:

𝑄 ㉠ W𝑄X ; 𝐾 ㉠ W𝐾X ; 𝑉 ㉠ W𝑉X (4)

Then the similarity between the matrix and K is calculated by dot product:

ƒ 𝑄,𝐾 ㉠ 𝑄𝐾𝑇 (5)

In the process of decoder, requiring only calculated before the current time and time characteristics, the similarity between the information on subsequent moment for shielding, usually under the introduction of a triangle total of 0 and upper triangular total negative infinite matrix, then, the matrix calculated by Equation (5) is transformed into a lower triangular matrix by replacing the negative infinity in the final result with 0. Finally, Softmax function is used to normalize the output results, and weighted average calculation is carried out according to the distribution mechanism of attention:

𝐴𝑡𝑡𝑒𝑛𝑡i𝑜𝑛 𝑄,, ㉠ 𝑠𝑜ƒ𝑡𝑚𝑎𝑥 ƒ 𝑄,𝐾 𝑉 (6)

Multi-head attention mechanism actually is to multiple independent attention together, as an integrated effect, on the one hand, can learn more information from various angles, on the one hand, can prevent the fitting, according to the calculation of long attention, if the above results when the quotas for time calculation, then the final result stitching together, converted into a linear output:

𝑚𝑢𝑙𝑡i𝐻𝑒𝑎𝑑 𝑄,, ㉠ 𝑐𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1,…,ℎ𝑒𝑎𝑑𝑛 W𝑂 (7)

2D – Attention

The Attention structure in Transformer only models the position correlation in the time domain. However, human beings rely on both time domain and frequency domain changes when listening to speech, so the 2D-Attention structure is applied here, as shown in Figure 3, that is, the position correlation in both time domain and frequency domain is modeled. It helps to enhance the invariance of the model in time domain and frequency domain.

fig 3

Figure 3: The structure of 2D-Attention.

Its calculation formula is as follows:

2𝐷 − 𝐴𝑡𝑡𝑒𝑛𝑡i𝑜𝑛 𝐼 ㉠ W𝑂 * 𝑐𝑜𝑛𝑐𝑎𝑡i𝑐ℎ𝑎𝑛𝑛𝑒𝑙ƒ,…,ℎ𝑎𝑛𝑛𝑒𝑙ƒ,

1 𝑐

𝑐ℎ𝑎𝑛𝑛 ,…,𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑡s (8)

1 𝑐

𝑤ℎ𝑒𝑟𝑒 𝑐ℎ𝑎𝑛𝑛𝑒𝑙ƒ ㉠ 𝑎𝑡𝑡𝑒𝑛𝑡i𝑜𝑛i W𝑄 * 𝐼 𝑇, W𝐾 * 𝐼 𝑇, W𝑉 * 𝐼 𝑇s

i i i i

𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑡 ㉠ 𝑎𝑡𝑡𝑒𝑛𝑡i𝑜𝑛i W𝑄 * , W𝐾 * 𝐼 , W𝑉 * 𝐼 s

i i i i

After calculating the 2-dimensional multi-head attention mechanism, a feed-forward neural network is required at each layer, including a fully connected layer and a linear layer. The activation function of the fully connected layer is ReLU:

𝐹𝐹𝑁 𝑥 ㉠ 𝑚𝑎𝑥 0,W1 + 𝑏1 W2 + 𝑏2 (9)

In order to prevent the gradient from disappearing, the residual connection mechanism should be introduced to transfer the input from the bottom layer directly to the upper layer without passing through the network, so as to slow down the loss of information and improve the training stability:

𝑥 + 𝑆𝑢𝑏𝐵𝑙𝑜𝑐𝑘i𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚i𝑥ss (10)

To sum up, the Transformer model with 2D-attention is shown in Figure 4. The speech features are first convolved by two operations, which on the one hand can improve the model’s ability to learn time-domain information. On the other hand, the time dimension of the feature can be reduced to close to the quotaslength of the target output, which can save calculation and alleviate the mismatch between the length of the feature sequence and the target sequence.

fig 4

Figure 4: Overall network architecture of Transformer model with 2D-Attention.

The loss function is constructed according to the principle of maximum likelihood estimation:

𝐿𝑎𝑡𝑡 ㉠− 𝑙𝑜𝑔𝑝 𝑦1,2,…,𝑦𝑇’ 𝑥1,𝑥2,…,𝑥𝑇

㉠ ∑𝑇’ 𝑝i𝑦 g,,…,,s (11)

𝑡’ ㉠1 𝑡’ 1 2

𝑡’ −1

The final loss function consists of a linear combination of CTC and Transformer’s losses:

𝐿 ㉠ 𝜇𝐿𝐶𝑇𝐶 + 𝛾𝐿𝑎𝑡𝑡 (12)

FBANK Feature Extraction

The process of FBANK feature extraction is shown in Figure 5. In all experiments, the sampling frequency of 1.6KHz and the 40-dimensional FBANK feature vector are adopted for audio data, 25ms for each frame and 10ms for frame shift.

fig 5

Figure 5: Flowchart of FBANK feature extraction.

Speech Feature Enhancement

In computer vision, there is a feature enhancement method called “Cutout” [19]. Inspired by this method, this paper adopts time and frequency shielding mechanism to screen a continuous time step and MEL frequency channel respectively. Through this feature enhancement method, the purpose of overfitting can be avoided. The specific methods are as follows:

（1） Frequency shielding: Shielding ƒ consecutive MEL frequency channels:

[ƒ0,ƒ0 + ƒs, replace them with 0. Where, is from zero to custom frequency shielding parameters randomly chosen from a uniform distribution, and ƒ0 is selected from 0, − ƒ randomly, 𝑣 is the number of MEL frequency channel.

（2） Time shielding: Time step [𝑡0,𝑡0 + 𝑡s for shielding, use 0 for replacement, including from zero to a custom time block parameters randomly chosen from a uniform distribution, 𝑡0 from [0,𝑟 − 𝑡s randomly selected. The speech characteristics of the original spectra and after time and frequency shield the spectrogram characteristic of language contrast as shown in Figure 6, to achieve the purpose of to strengthen characteristics.

fig 6

Figure 6: Feature enhancement contrast chart.

Label Smoothing

The learning direction of neural network is usually to maximize the gap between correct labels and wrong labels. However, when the training data is relatively small, it is difficult for the network to accurately represent all the sample characteristics, which will lead to overfitting. Label smoothing solves this problem by means of regularization. By introducing a noise mechanism, it alleviates the problem that the weight proportion of the real sample label category is too large in the calculation of loss function, and then plays a role in inhibiting overfitting. The true probability distribution after adding label smoothing becomes:

1, iƒii ㉠ 𝑦s

1 − s, iƒii ㉠ 𝑦s

𝑃i ㉠0, iƒii G 𝑦s 𝑃i ㉠s

𝐾−1, iƒii G 𝑦s

Where K represents the total number of categories of multiple classifications, and is a small hyperparameter.

Experiment

Experimental Model

The end-to-end model adopted in this paper is a hybrid model based on Linked Temporalism and Transformer based on 2-dimensional multi-head attention. Compare the end-to-end model based on RNN-T and the model based on multiple heads of attention. The experiment was carried out under the Pytorch framework, the GPU RTX 3080.

Data

This article uses Hill Shell’s open source Aishell dataset, which contains about 178 hours of open source data. The dataset contains almost 400 recorded voices from people with different accents from different regions. Recording was done in a relatively quiet indoor environment using three different devices: a high-fidelity microphone (44.1kHz, 16-bit); IOS mobile devices (16kHz, 16-bit); Android mobile device (16kHz, 16-bit) to record, and then by sampling down to 16kHz.

Network Structure and Parameters

The network in this paper uses four layers of multi-head attention, and the input attention dimension of each layer is 256, the input feature dimension of the forward full connection layer is 256, and the hidden feature dimension is 2048. The combined training parameter λ is 0.1, the rate of random loss of activated cells is 0.1, and the label smoothing is 0.1. The epoch times are 200.

Evaluation Index

In the evaluation of experimental results, word error rate (WER) was used as the evaluation index. Word error rate is identified primarily for the purpose of make can make between words and real words sequences of the same, the need for specific words, insert, substitute, or delete these insertion (I), substitution (S) or deletion (D) of the total number of words, divided by the real word sequence of all the percentage of the total number of words namely

W𝐸𝑅 ㉠ 100 × 𝐼 + 𝑆 + 𝐷 %

𝑁

Experimental Results and Analysis

Table 1 shows the comparison between the 2D-attention model without CTC and the 2D-attention model with CTC. Compared with the ordinary model, the performance of the model with CTC is improved by 6.52%-10.98%, and the word error rate of the end-to-end model with RNN-T is reduced by 4.26%. Performance improved by 37.07%.

Table 1: Comparison of model word error rate.

Model	Test-WER/%
RNN-T	11.50
4Enc+3Dec	9.320
4Enc+4Dec	9.165
4Enc+4Dec+0.1CTC	8.567
6Enc+3Dec	8.130
6Enc+3Dec+0.1CTC	7.237

Figure 7 shows the comparison of the loss functions of the two models (2D-attention model without CTC and 2D-attention model with CTC). Compared with the ordinary model, the loss of the constrained model with CTC can reach a smaller value.

fig 7

Figure 7: Contrast chart of loss changes.

Conclusion

In this paper, we propose a hybrid architecture model of Transformer using CTC and 2-dimensional Multi-head Attention to apply to Mandarin speech recognition. Compared with the traditional speech recognition model, it does not need to separate the acoustic model and the language model to train, only needs to train a single model, from the time domain and frequency domain two perspectives to learn the speech information, can achieve advanced model recognition rate. It is found that compared with the end-to-end model of RNN-T, the performance of Transformer model is better, and the increase in the depth of the encoder can better learn the information contained in the speech, which can significantly improve the performance of the model for Mandarin recognition, while the increase in the depth of the decoder has little effect on the overall performance of speech recognition. At the same time, by introducing a link of sequence alignment is improved, the model makes the model to achieve the best effect, but on some professional vocabulary is not accurate, the subsequent research by increasing solution of language model, at the same time, in view of the very deep network training speed slow problem, to improve and upgrade.

Acknowledgement

This work was supported by the Philosophical and Social Sciences Research Project of Hubei Education Department (19Y049), and the Staring Research Foundation for the Ph.D. of Hubei University of Technology (BSQD 2019054), Hubei Province, China.

References

Yu D, Deng L, YU Kai et al. (2016) Analytical Deep Learning: Practice of Speech Recognition [M]. Beijing: Publishing House of Electronics Industry.
Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-train deep neural networks for large vocabulary speech recognition. Audio, Speech, and Language Processing. IEEE Transactions 20: 30-42.
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shar Views of Four Research Groups. Signal Processing Magazine, IEEE 29: 82-97.
Hochreiter S, Schmidhuber J (1997) Long Short-term Neural Computation 9: 1735-1780.
Zhang Y, Chen GG, Yu D, Yao K, Khudanpur S, et al. (2016) Highway Long Short-term Memory RNNS for Distant Speech Recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, March 20-25 Shanghai, China. Piscataway: IEEE Press.
Lecun Y, Bengio Y (1995) Convolutional Networks for Images, Speech and Time-series. Cambridge: MIT Press.
Abdel-Hamid O, Moham AR, Jiang H, Penn G (2012) Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM model for Speech IEEE International Conference on Acoustics, Speech and Signal Processing, March 20, 2012, Kyoto, Japan. Piscataway: IEEE Press: 4277-4280.
Abdel-Hamid O, Moham AR, Jiang H, Deng L, Penn G, et al. (2014) Convolutional Neural Networks for Speech IEEE/ACM Transactions on Audio Speech & Language Processing 22: 1533-1545.
Abdel-Hamid O, Deng L, Yu D (2013) Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition. Interspeech 58: 1173-1175.
Sainath T N, Moham A R, Kingsbury B, Ramabhadran B (2013) Deep Convolutional Neural Networks for IEEE International Conference on Acoustics, Speech and Signal Processing, May 26-30: Vancouver, BC, Canada. Piscataway: IEEE Press: 8614-8618.
Sainath T N, Vinyals O, Senior A, Sak H (2015) Convolutional, Long Short-term Memory, Fully Connect Deep Neural Networks. IEEE International Conference on Acoustics, Speech and Signal Processing, April 19-24, Brisbane, QLD, Australia. Piscataway: IEEE Press: 4580-4584.
Graves A, Fernández S, Gomez F, Scmidhuber J (2006) Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd international conference on Machine learning 369-376.
Vaswani A, Shazeer N, Pamar N, Uszkoreit J, Jones L, et al. (2017) Attention is All You Need. Advances in Neural Information Processing Systems 6000-6010.
Dong L, Xu S, Xu B (2018) Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Abdelrahman M, Dmytro O, Luke Z (2019) Transformers with Convolutional Context for ASR. arXiv 1904: 11660.
Kyu J H, Akshay C, Jungsuk K, and Ian L (2017) The Capio 2017 Conversational Speech Recognition System. arXiv 1801: 00059.
Karita S, Wang X, Watanabe S, Yoshimura T, Zhang W, et al. (2019) A Comparative Study on Transformer vs RNN in Speech Applications. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
Yang W, Hu Y (2021) End-to-end Accent Putonghua Recognition Based on Hybrid CTC/ Attention Application Research of Computers 38: 755-759.
Pham NQ, Nguyen TS, Niehues J, et al. (2019) Very Deep Self-attention Networks for End-to-end Speech Recognition[EB/OL]. arXiv 13377: 1904.
Ekin D C, Barret Z, Dandelion M, Vijay V, Quoc V (2018) Autoaugment: Learning Augmentation Policies from arXiv 09501: 1805.

Article Type

Research Article

Publication history

Received: September 12, 2021
Accepted: September 17, 2021
Published: October 25, 2021

Citation

Zhao B, Mingzhe E, Jiang X (2021) Application of Hybrid CTC/2D-Attention End-to-End Model in Speech Recognition during the COVID-19 Pandemic. Mol Genet Res Open Volume 4(2): 1–7. DOI: 10.31038/MGJ.2021423

Corresponding author

Bin Zhao
School of Science
Hubei University of Technology
Wuhan, Hubei
China

View / Download