Twin Regularization for Online Speech Recognition

ArXiv Paper Accepted at Interspeech 2018

"You can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future." - Steve Jobs

Deep learning has recently revolutionized speech recognition, contributing to achieve an unprecedented performance level. Nevertheless, the road towards a more natural human-machine speech interaction is still long and full of scientific challenges. One important issue is the performance drop observed when going from off-line to online speech recognition. So, what's the difference between on-line and off-line speech recognition?

Off-line speech recognition is used, for instance, to transcribe youtube videos. In this case, we have access to the full speech sequence and we don't have any real time constraint (i.e., we can employ a very computational demanding transcriber, we can use multi-pass speech recognition, etc.). On-line speech recognition, instead, is often used when an interaction with a user is needed (e.g., think about systems like Siri, Amazon Alexa, Google home, smart TVs, etc). In this case real-time/low-latency constraints arise, making the latter significantly more challenging. To minimize the latency, the decoding step (which is normally very computationally demanding) should start while recording the speech signal itself. This means that we only have access to past elements of the speech sequence, but we don't have access to the future ones. We can thus only employ unidirectional recurrent neural networks (RNNs) and not bidirectional RNNs.

The philosophy followed by our recent paper to mitigate this issue is the following: "If we don't have access to the future, why don't we try to predict it?".

Fortunately, neural network are quite effective to perform such predictions and we can try to roughly estimate the future elements of the speech sequence only using the past ones. There are various ways to do it. In our recent paper we explored the use of Twin Regularization. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. To do it, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time (see the image below).

The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks.

Some Related References:

[1] M. Ravanelli, D. Serdyuk, Y. Bengio, "Twin Regularization for online speech recognition", accepted at Interspeech 2018.

[2] D. Serdyuk, N. R. Ke, A. Sordoni, A. Trischler, C. Pal, and Y. Bengio, “Twin networks: Matching the future for sequence generation,” in Proc. of ICLR, 2018

[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Light Gated Recurrent Units for Speech Recognition", in IEEE Transactions on Emerging Topics in Computational Intelligence, 2018. [4] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Improving speech recognition by revising gated recurrent units", in Proceedings of Interspeech 2017.

[5] M. Ravanelli, "Deep Learning for Distant Speech Recognition", PhD Thesis, Unitn 2017.