arXiv:2106.06636v1 [cs.CL] 11 Jun 2021

Direct Simultaneous Speech-to-Text TranslationAssisted by Synchronized Streaming ASR ∗

Junkun Chen 1 Mingbo Ma 2 Renjie Zheng 2 Liang Huang 1,2

1Oregon State University, Corvallis, OR, USA2Baidu Research, Sunnyvale, CA, USA

[email protected], [email protected]

Abstract

Simultaneous speech-to-text translation iswidely useful in many scenarios. The con-ventional cascaded approach uses a pipeline ofstreaming ASR followed by simultaneous MT,but suffers from error propagation and extra la-tency. To alleviate these issues, recent effortsattempt to directly translate the source speechinto target text simultaneously, but this is muchharder due to the combination of two separatetasks. We instead propose a new paradigmwith the advantages of both cascaded and end-to-end approaches. The key idea is to usetwo separate, but synchronized, decoders onstreaming ASR and direct speech-to-text trans-lation (ST), respectively, and the intermediateresults of ASR guide the decoding policy of(but is not fed as input to) ST. During trainingtime, we use multitask learning to jointly learnthese two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuST-C dataset demonstrate that our proposed tech-nique achieves substantially better translationquality at similar levels of latency.

1 Introduction

Simultaneous speech-to-text translation incremen-tally translates source-language speech into target-language text, and is widely useful in many cross-lingual communication scenarios such as interna-tional travels and multinational conferences. Theconventional approach to this problem is a cas-caded one (Arivazhagan et al., 2020; Xiong et al.,2019; Zheng et al., 2020b), involving a pipeline oftwo steps. First, the streaming automatic speechrecognition (ASR) module transcribes the inputspeech on the fly (Moritz et al., 2020; Wang et al.,2020), and then a simultaneous text-to-text transla-tion module translates the partial transcription intotarget-language text (Oda et al., 2014; Dalvi et al.,

∗ See our translation examples and demos athttps://littlechencc.github.io/SimulST-demo/simulST-demo.html.

Streaming ASR

wait-2

unstable results

wait-k policy

streaming speech input

extra delay

wait-2 wait-k policy

streaming speech input

streaming speech inputwait-k policy…

Streaming ASR

(a)

(b)

(c)

wait-2

our method

Ren et al,. 2020Ma et al,. 2020 (a,b)

cascaded<latexit sha1_base64="KvmtD2Qzol9BIWTg7UGnEKQGWco=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6GG/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+i6t6dV2rXeRxFOIJjOAUPLqEGt1CHBjAYwDO8wpsjnRfn3fmYtxacfOYQ/sD5/AEQYo2n</latexit>z1

<latexit sha1_base64="NzBeE9vq5mC9DhZG+vA9j3WAOK8=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Vj04rGi/YA2lM120y7dbMLuRKihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindRce/Oy7XrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AER5o2o</latexit>z2<latexit sha1_base64="EFoKUrMUVpSXFxYRkZZwK543aNE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU1GPRi8eK9gPaUDbbSbt0swm7G6GG/gQvHhTx6i/y5r9x2+ag1QcDj/dmmJkXJIJr47pfTmFpeWV1rbhe2tjc2t4p7+41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtB1Sax/LejBP0IzqQPOSMGivdPfZOe+WKW3VnIH+Jl5MK5Kj3yp/dfszSCKVhgmrd8dzE+BlVhjOBk1I31ZhQNqID7FgqaYTaz2anTsiRVfokjJUtachM/TmR0UjrcRTYzoiaoV70puJ/Xic14aWfcZmkBiWbLwpTQUxMpn+TPlfIjBhbQpni9lbChlRRZmw6JRuCt/jyX9I8qXrnVff2rFK7yuMowgEcwjF4cAE1uIE6NIDBAJ7gBV4d4Tw7b877vLXg5DP78AvOxzcTao2p</latexit>z3

<latexit sha1_base64="gb5lGORDGwF6MLqPq9ofeWi6bHo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6GG/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPvvFeuuFV3BrJMvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0rd1PCEshEd8I6likbc+Nns1Ak5sUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophld+JlSSIldsvihMJcGYTP8mfaE5Qzm2hDIt7K2EDammDG06JRuCt/jyMmmeVb2Lqnt3Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcU7o2q</latexit>z4<latexit sha1_base64="GDL7BDHOdr32z1dkxvdmjx3Yb5E=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYnvcmQ2dllZlaISz7BiwdFvPpF3vwbJ8keNFrQUFR1090VJIJr47pfTmFpeWV1rbhe2tjc2t4p7+41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtB1Sax/LejBP0IzqQPOSMGivdPfbOeuWKW3VnIH+Jl5MK5Kj3yp/dfszSCKVhgmrd8dzE+BlVhjOBk1I31ZhQNqID7FgqaYTaz2anTsiRVfokjJUtachM/TmR0UjrcRTYzoiaoV70puJ/Xic14aWfcZmkBiWbLwpTQUxMpn+TPlfIjBhbQpni9lbChlRRZmw6JRuCt/jyX9I8qXrnVff2tFK7yuMowgEcwjF4cAE1uIE6NIDBAJ7gBV4d4Tw7b877vLXg5DP78AvOxzcWco2r</latexit>z5

<latexit sha1_base64="ZHnLwo37xRG7ecG/MDNfJXlsmbk=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rGi/YA2lM120i7dbMLuRqihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6mfqtR1Sax/LBjBP0IzqQPOSMGivdP/WqvVLZrbgzkGXi5aQMOeq90le3H7M0QmmYoFp3PDcxfkaV4UzgpNhNNSaUjegAO5ZKGqH2s9mpE3JqlT4JY2VLGjJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOVnXCapQcnmi8JUEBOT6d+kzxUyI8aWUKa4vZWwIVWUGZtO0YbgLb68TJrnFa9ace8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c0Rzovz7nzMW1ecfOYI/sD5/AEX9o2s</latexit>z6<latexit sha1_base64="mVKnGAb5ggjFGnPRG0Ry1FiSU54=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KomI9Vj04rGi/YA2lM120i7dbMLuRqihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6mfqtR1Sax/LBjBP0IzqQPOSMGivdP/WqvVLZrbgzkGXi5aQMOeq90le3H7M0QmmYoFp3PDcxfkaV4UzgpNhNNSaUjegAO5ZKGqH2s9mpE3JqlT4JY2VLGjJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOVnXCapQcnmi8JUEBOT6d+kzxUyI8aWUKa4vZWwIVWUGZtO0YbgLb68TJrnFe+y4t5dlGvXeRwFOIYTOAMPqlCDW6hDAxgM4Ble4c0Rzovz7nzMW1ecfOYI/sD5/AEZeo2t</latexit>z7

<latexit sha1_base64="GDL7BDHOdr32z1dkxvdmjx3Yb5E=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYnvcmQ2dllZlaISz7BiwdFvPpF3vwbJ8keNFrQUFR1090VJIJr47pfTmFpeWV1rbhe2tjc2t4p7+41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtB1Sax/LejBP0IzqQPOSMGivdPfbOeuWKW3VnIH+Jl5MK5Kj3yp/dfszSCKVhgmrd8dzE+BlVhjOBk1I31ZhQNqID7FgqaYTaz2anTsiRVfokjJUtachM/TmR0UjrcRTYzoiaoV70puJ/Xic14aWfcZmkBiWbLwpTQUxMpn+TPlfIjBhbQpni9lbChlRRZmw6JRuCt/jyX9I8qXrnVff2tFK7yuMowgEcwjF4cAE1uIE6NIDBAJ7gBV4d4Tw7b877vLXg5DP78AvOxzcWco2r</latexit>z5<latexit sha1_base64="ZHnLwo37xRG7ecG/MDNfJXlsmbk=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rGi/YA2lM120i7dbMLuRqihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6mfqtR1Sax/LBjBP0IzqQPOSMGivdP/WqvVLZrbgzkGXi5aQMOeq90le3H7M0QmmYoFp3PDcxfkaV4UzgpNhNNSaUjegAO5ZKGqH2s9mpE3JqlT4JY2VLGjJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOVnXCapQcnmi8JUEBOT6d+kzxUyI8aWUKa4vZWwIVWUGZtO0YbgLb68TJrnFa9ace8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c0Rzovz7nzMW1ecfOYI/sD5/AEX9o2s</latexit>z6

<latexit sha1_base64="KvmtD2Qzol9BIWTg7UGnEKQGWco=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6GG/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+i6t6dV2rXeRxFOIJjOAUPLqEGt1CHBjAYwDO8wpsjnRfn3fmYtxacfOYQ/sD5/AEQYo2n</latexit>z1<latexit sha1_base64="NzBeE9vq5mC9DhZG+vA9j3WAOK8=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Vj04rGi/YA2lM120y7dbMLuRKihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindRce/Oy7XrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AER5o2o</latexit>z2

<latexit sha1_base64="EFoKUrMUVpSXFxYRkZZwK543aNE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU1GPRi8eK9gPaUDbbSbt0swm7G6GG/gQvHhTx6i/y5r9x2+ag1QcDj/dmmJkXJIJr47pfTmFpeWV1rbhe2tjc2t4p7+41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtB1Sax/LejBP0IzqQPOSMGivdPfZOe+WKW3VnIH+Jl5MK5Kj3yp/dfszSCKVhgmrd8dzE+BlVhjOBk1I31ZhQNqID7FgqaYTaz2anTsiRVfokjJUtachM/TmR0UjrcRTYzoiaoV70puJ/Xic14aWfcZmkBiWbLwpTQUxMpn+TPlfIjBhbQpni9lbChlRRZmw6JRuCt/jyX9I8qXrnVff2rFK7yuMowgEcwjF4cAE1uIE6NIDBAJ7gBV4d4Tw7b877vLXg5DP78AvOxzcTao2p</latexit>z3<latexit sha1_base64="gb5lGORDGwF6MLqPq9ofeWi6bHo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6GG/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPvvFeuuFV3BrJMvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0rd1PCEshEd8I6likbc+Nns1Ak5sUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophld+JlSSIldsvihMJcGYTP8mfaE5Qzm2hDIt7K2EDammDG06JRuCt/jyMmmeVb2Lqnt3Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcU7o2q</latexit>z4

<latexit sha1_base64="GDL7BDHOdr32z1dkxvdmjx3Yb5E=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYnvcmQ2dllZlaISz7BiwdFvPpF3vwbJ8keNFrQUFR1090VJIJr47pfTmFpeWV1rbhe2tjc2t4p7+41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtB1Sax/LejBP0IzqQPOSMGivdPfbOeuWKW3VnIH+Jl5MK5Kj3yp/dfszSCKVhgmrd8dzE+BlVhjOBk1I31ZhQNqID7FgqaYTaz2anTsiRVfokjJUtachM/TmR0UjrcRTYzoiaoV70puJ/Xic14aWfcZmkBiWbLwpTQUxMpn+TPlfIjBhbQpni9lbChlRRZmw6JRuCt/jyX9I8qXrnVff2tFK7yuMowgEcwjF4cAE1uIE6NIDBAJ7gBV4d4Tw7b877vLXg5DP78AvOxzcWco2r</latexit>z5

<latexit sha1_base64="cwvsV9kZQYfjKp7at+GM3y72gls=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK1hbaUDbbTbt0swm7EyGE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmlldW19o7xZ2dre2d2r7h88mjjVjLdYLGPdCajhUijeQoGSdxLNaRRI3g7GN1O//cS1EbF6wCzhfkSHSoSCUbTSfdb3+tWaW3dnIMvEK0gNCjT71a/eIGZpxBUySY3pem6Cfk41Cib5pNJLDU8oG9Mh71qqaMSNn89OnZATqwxIGGtbCslM/T2R08iYLApsZ0RxZBa9qfif100xvPJzoZIUuWLzRWEqCcZk+jcZCM0ZyswSyrSwtxI2opoytOlUbAje4svL5PGs7l3U3bvzWuO6iKMMR3AMp+DBJTTgFprQAgZDeIZXeHOk8+K8Ox/z1pJTzBzCHzifPw7cjaY=</latexit>y1<latexit sha1_base64="GZZ+Ubp29DwseIYaeWvk9dZQ5OE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeiF48V7Qe0oWy2k3bpZhN2N0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHkyXoR3QoecgZNVZ6yPq1frniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYf0KV4UzgtNRLNSaUjekQu5ZKGqH2J/NTp+TMKgMSxsqWNGSu/p6Y0EjrLApsZ0TNSC97M/E/r5ua8NqfcJmkBiVbLApTQUxMZn+TAVfIjMgsoUxxeythI6ooMzadkg3BW355lbRqVe+y6t5fVOo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnBfn3flYtBacfOYY/sD5/AEQYI2n</latexit>y2

<latexit sha1_base64="PmmgJ+AknyHpizZRrmqTL6fkvnE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU1GPRi8eK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94jjhfkQHSoSCUbTSw7h33itX3Ko7A1kmXk4qkKPeK391+zFLI66QSWpMx3MT9DOqUTDJJ6VuanhC2YgOeMdSRSNu/Gx26oScWKVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uw2s/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadkg3BW3x5mTTPqt5l1b2/qNRu8jiKcATHcAoeXEEN7qAODWAwgGd4hTdHOi/Ou/Mxby04+cwh/IHz+QMR5I2o</latexit>y3<latexit sha1_base64="vjgC1/maYdRVBNhrUezoI72Jpvo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2k3bpZhN2N0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHkyXoR3QoecgZNVZ6yPq1frniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYf0KV4UzgtNRLNSaUjekQu5ZKGqH2J/NTp+TMKgMSxsqWNGSu/p6Y0EjrLApsZ0TNSC97M/E/r5ua8NqfcJmkBiVbLApTQUxMZn+TAVfIjMgsoUxxeythI6ooMzadkg3BW355lbQuqt5l1b2vVeo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnBfn3flYtBacfOYY/sD5/AETaI2p</latexit>y4

<latexit sha1_base64="0epzTrGbClv12cWfL2VvsGSTbug=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllplcISz7BiwdFvPpF3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94jjhfkQHSoSCUbTSw7h30StX3Ko7A1kmXk4qkKPeK391+zFLI66QSWpMx3MT9DOqUTDJJ6VuanhC2YgOeMdSRSNu/Gx26oScWKVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uw2s/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadkg3BW3x5mTTPqt5l1b0/r9Ru8jiKcATHcAoeXEEN7qAODWAwgGd4hTdHOi/Ou/Mxby04+cwh/IHz+QMU7I2q</latexit>y5





Figure 1: Comparison between (a) cascaded pipeline,(b) direct simultaneous ST, and (c) our ASR-assistedsimultaneous ST. In (a), streaming ASR keeps revis-ing some tail words for better accuracy, but causing ex-tra delays to MT. Method (b) directly translates sourcespeech without using ASR. Our work (c) uses the inter-mediate results of the streaming ASR module to guidethe decoding policy of (but not feed as input to) thespeech translation module. Extra delays between ASRand MT are reduced in direct translation systems (b–c).

2018; Ma et al., 2019; Zheng et al., 2019a,b, 2020a;Arivazhagan et al., 2019).

However, the cascaded approach inevitably suf-fers from two limitations: (a) error propagation,where streaming ASR’s mistakes confuse the trans-lation module (which are trained on clean text),and this problem worsens with noisy environmentsand accented speech; and (b) extra latency, wherethe translation module has to wait until streamingASR’s output stabilizes, as ASR by default canrepeatedly revise its output (see Fig. 1).

To overcome the above issues, some recent ef-forts (Ren et al., 2020; Ma et al., 2020b,a) attemptto directly translate the source speech into targettext simultaneously by adapting text-based wait-kstrategy (Ma et al., 2019). However, unlike simulta-neous translation whose input is already segmentedinto words or subwords, in speech translation, thekey challenge is to figure out the number of validtokens within a given source speech segment in or-

arX

iv:2

106.

0663

6v1

[cs

.CL

] 1

1 Ju

n 20

21

https://littlechencc.github.io/SimulST-demo/simulST-demo.html

der to apply the wait-k policy. Ma et al. (2020b,a)simply assume a fixed number of words withina certain number of speech frames, which doesnot consider various aspects of speech such as dif-ferent speech rate, duration, pauses and silences,all of which are common in realistic speech. Renet al. (2020) design an extra Connectionist Tempo-ral Classification (CTC)-based speech segmenterto detect the word boundaries in speech. However,the CTC-based segmenter inherits the same short-coming of CTC, which only makes local predic-tions, thus limiting its segmentation accuracy. Onthe other hand, to alleviate the error propagation,Ren et al. (2020) employ several different knowl-edge distillation techniques to learn the attentionsof ASR and MT jointly. These knowledge distilla-tion techniques are complicated to train and it is anindirect solution for the error propagation problem.

We instead present a simple but effective solu-tion (see Fig. 2) by employing two separate, butsynchronized, decoders, one for streaming ASRand the other for End-to-End Speech-to-text Trans-lation (E2E-ST). Our key idea is to use the interme-diate results of streaming ASR to guide the decod-ing policy of, but not feed as input to, the E2E-STdecoder. We look at the beam of streaming ASRto decide the number of tokens within the givensource speech segment. Then it is straightforwardfor the E2E-ST decoder to apply the wait-k pol-icy and decide whether to commit a target word orto wait for more speech frames. During trainingtime, we jointly train ASR and E2E-ST tasks witha shared speech encoder in a multi-task learning(MTL) fashion to further improve the translationaccuracy. We also note that having streaming ASRas an auxiliary output is extremely useful in realapplication scenarios where the user often wants tosee both the transcription and the translation. En-to-De and En-to-Es experiments on the MuST-Cdataset demonstrate that our proposed techniqueachieves substantially better translation quality atsimilar level of latency.

2 Preliminaries

We formalize full-sentence tasks (ASR, MT andST) using the sequence-to-sequence framework,and the streaming tasks (simultaneous MT andstreaming ASR) using the test-time wait-k method.

Full-Sentence Tasks: ASR, NMT and ST Theencoder first encodes the entire source input into asequence of hidden states; in NMT, the input is a

sequence of words, x = (x1, x2, ..., xm), while inASR and ST, we use s to denote the input speechframes. A decoder sequentially predicts target lan-guage tokens y = (y1, y2, ..., yn) in NMT and STor transcription z in ASR, conditioned on all en-coder hidden states and previously committed to-kens. For example, the NMT model and its param-eters θMT

full are defined as:

pfull(y | x;θMTfull ) =

∏|y|

t=1p(yt | x,y<t;θMT

full )

θMTfull = argmax

θMTfull

∏

(x,y∗)∈D

pfull(y∗ | x;θMT

full )

Similarly, we can obtain the definitions for ASR(pfull(z | s;θASR

full )) and ST (pfull(y | s;θSTfull)). Our

model was learned from scratch in this work, but itcan be improved with pre-training methods (Zhenget al., 2021; Chen et al., 2020).

Simultaneous MT and Streaming ASR Instreaming decoding scenarios, we have to predicttarget tokens conditioned on the partial source in-put that is available. For example, the test-timewait-k method of Ma et al. (2019) predicts eachtarget token yt after reading source tokens x≤t+kusing a full-sentence NMT model:

yt=argmaxyt

pwait-k(yt | x≤t+k, y<t; θMTfull ) (1)

Intuitively speaking, wait-k only commits a newtarget word on receiving each new source wordafter an initial k source words waiting. Similarly, inthe case of streaming ASR, we could define zt withgrowing speech chunks si that are fed gradually.

3 Direct Simultaneous Translation withSynchronized Streaming ASR

In text-to-text simultaneous translation, the inputstream is already segmented. However, when wedeal with speech frames as source inputs, it is noteasy to determine the number of valid tokens withincertain speech segments. Therefore, to better guidethe translation policy, it is essential to detect thenumber of valid tokens accurately within low la-tency. Different from the sophisticated design ofspeech segmenter in Ren et al. (2020), we proposea simple but effective method by using a synchro-nized streaming ASR and using its beam to deter-mine the number of words within certain speechsegments. Note that we only use streaming ASR forsource word counting, but the translation decoderdoes not condition on any of ASR’s output.

speech encoder

CTC

External LM

Translation DecoderE2E-ST

shared speech representations

<latexit sha1_base64="dgN9uVOD+2BKJ9zEVFrKUeZT8sM=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMFWyttKJvtpl26uwm7EyGE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmlldW19o7xZ2dre2d2r7h+0TZxqylo0FrHuhMQwwRVrIUfBOolmRIaCPYTjm6n/8MS04bG6xyxhgSRDxSNOCVrpsTcimGeTvt+v1ry6N4O7TPyC1KBAs1/96g1imkqmkApiTNf3EgxyopFTwSaVXmpYQuiYDFnXUkUkM0E+O3jinlhl4EaxtqXQnam/J3IijclkaDslwZFZ9Kbif143xegqyLlKUmSKzhdFqXAxdqffuwOuGUWRWUKo5vZWl46IJhRtRhUbgr/48jJpn9X9i7p3d15rXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwB3guQcw==</latexit>

y1<latexit sha1_base64="QHn4V5idVYvBaRek7JS9co1WPZo=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeiF48V7Ie0oWy223bpbhJ2J0II/RVePCji1Z/jzX/jts1BWx8MPN6bYWZeEEth0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPdCajhUoS8iQIl78SaUxVI3g4mtzO//cS1EVH4gGnMfUVHoRgKRtFKj70xxSyd9mv9csWtunOQVeLlpAI5Gv3yV28QsUTxEJmkxnQ9N0Y/oxoFk3xa6iWGx5RN6Ih3LQ2p4sbP5gdPyZlVBmQYaVshkrn6eyKjyphUBbZTURybZW8m/ud1Exxe+5kI4wR5yBaLhokkGJHZ92QgNGcoU0so08LeStiYasrQZlSyIXjLL6+SVq3qXVbd+4tK/SaPowgncArn4MEV1OEOGtAEBgqe4RXeHO28OO/Ox6K14OQzx/AHzucP34+QdA==</latexit>

y2<latexit sha1_base64="3wz7ezzLKub87Q69DwNEpSjJJiU=">AAAB9HicbVDLSgNBEOz1GeMr6tHLYBC8GHZF1GPQi8cI5gHJEmYns8mQ2YczvYFl2e/w4kERr36MN//GSbIHTSxoKKq66e7yYik02va3tbK6tr6xWdoqb+/s7u1XDg5bOkoU400WyUh1PKq5FCFvokDJO7HiNPAkb3vju6nfnnClRRQ+YhpzN6DDUPiCUTSS2xtRzNK8n+G5k/crVbtmz0CWiVOQKhRo9CtfvUHEkoCHyCTVuuvYMboZVSiY5Hm5l2geUzamQ941NKQB1242Ozonp0YZED9SpkIkM/X3REYDrdPAM50BxZFe9Kbif143Qf/GzUQYJ8hDNl/kJ5JgRKYJkIFQnKFMDaFMCXMrYSOqKEOTU9mE4Cy+vExaFzXnqmY/XFbrt0UcJTiGEzgDB66hDvfQgCYweIJneIU3a2K9WO/Wx7x1xSpmjuAPrM8f72+SNA==</latexit>

yt�1<latexit sha1_base64="50woGAM1Ywk/c1QA3jVo7XqFf0k=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE1GPRi8cK9gPSUDbbTbt0sxt2J0II+RlePCji1V/jzX/jts1BWx8MPN6bYWZemAhuwHW/ncra+sbmVnW7trO7t39QPzzqGpVqyjpUCaX7ITFMcMk6wEGwfqIZiUPBeuH0bub3npg2XMlHyBIWxGQsecQpASv5gwmBPCuGORTDesNtunPgVeKVpIFKtIf1r8FI0TRmEqggxviem0CQEw2cClbUBqlhCaFTMma+pZLEzAT5/OQCn1llhCOlbUnAc/X3RE5iY7I4tJ0xgYlZ9mbif56fQnQT5FwmKTBJF4uiVGBQePY/HnHNKIjMEkI1t7diOiGaULAp1WwI3vLLq6R70fSumu7DZaN1W8ZRRSfoFJ0jD12jFrpHbdRBFCn0jF7RmwPOi/PufCxaK045c4z+wPn8AQ3KkcI=</latexit>

yt…3rd chunk1st chunk 2nd chunk 4th chunk …

…

streaming input speech chunks

<latexit sha1_base64="o6+b/RZgSqR7dJurbHip3jYshok=">AAACGHicbVC7TsMwFHV4U14BRpaICompJAgBI4+FsTwKSE2JHPemteo4kX2DqKJ8Bgu/wsIAQqzd+BvckgEKR7J0fM69ss8JU8E1uu6nNTE5NT0zOzdfWVhcWl6xV9eudZIpBg2WiETdhlSD4BIayFHAbaqAxqGAm7B3OvRv7kFpnsgr7KfQimlH8ogzikYK7B0fpM4UxBS7ud+lmKdFkPsID5hHmRBFUdyV1+PLi6II7Kpbc0dw/hKvJFVSoh7YA7+dsCwGiUxQrZuem2Irpwo5E1BU/ExDSlmPdqBpqKQx6FY+ClY4W0ZpO1GizJHojNSfGzmNte7HoZkcBtDj3lD8z2tmGB22ci7TDEGy74dMXAcTZ9iS0+YKGIq+IZQpbv7qsC5VlKHpsmJK8MYj/yXXuzVvv+ae71WPTso65sgG2STbxCMH5IickTppEEYeyTN5JW/Wk/VivVsf36MTVrmzTn7BGnwBgzuijw==</latexit>

pASRfull

<latexit sha1_base64="frfmppBLzasrIwFGL76vydjDEPA=">AAACFHicdVBNaxsxENW6aes4/XDSYy6iplAoLLK3dp1b2lx6dNP4A2zXaOWxLazVLtJsiFn2R+SSv5JLDikh1xxy67+J/FFoS/NA8PTeDDPzwkRJi4z98gpPtp4+e17cLu28ePnqdXl3r2Pj1Ahoi1jFphdyC0pqaKNEBb3EAI9CBd1wfrT0u6dgrIz1CS4SGEZ8quVECo5OGpU/DEDb1EDEcZZlST7KBghnmE1SpfI8/7H5fv5+nOejcoX5zaAR1APK/PpBo9ZgjrCgflBv0qrPVqiQDVqj8v1gHIs0Ao1CcWv7VZbgMOMGpVCQlwaphYSLOZ9C31HNI7DDbHVUTt85ZUwnsXFPI12pf3ZkPLJ2EYWucrm8/ddbiv/z+ilOmsNM6iRF0GI9yJ1LMabLhOhYGhCoFo5wYaTblYoZN1ygy7HkQvh9KX2cdGp+teGzbx8rh182cRTJPnlL3pMq+UQOyVfSIm0iyDm5JNfkp3fhXXk33u26tOBtet6Qv+DdPQC2FKEk</latexit>

pASRfull

streaming ASR

chunk size<latexit sha1_base64="IKyPjLKye36w45lJYuoHVfX9mvI=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix68diC/YA2lM120q7dbMLuRimhv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtR1Sax/LejBP0IzqQPOSMGivVn3qlsltxZyDLxMtJGXLUeqWvbj9maYTSMEG17nhuYvyMKsOZwEmxm2pMKBvRAXYslTRC7WezQyfk1Cp9EsbKljRkpv6eyGik9TgKbGdEzVAvelPxP6+TmvDaz7hMUoOSzReFqSAmJtOvSZ8rZEaMLaFMcXsrYUOqKDM2m6INwVt8eZk0zyveZcWtX5SrN3kcBTiGEzgDD66gCndQgwYwQHiGV3hzHpwX5935mLeuOPnMEfyB8/kD5luNAA==</latexit>w

downsampling rate

<latexit sha1_base64="z4a0Dml+uGsE/JF5T844sQ+/34E=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHM0nQj+hQ8pAzaqzUUP1yxa26c5BV4uWkAjnq/fJXbxCzNEJpmKBadz03MX5GleFM4LTUSzUmlI3pELuWShqh9rP5oVNyZpUBCWNlSxoyV39PZDTSehIFtjOiZqSXvZn4n9dNTXjjZ1wmqUHJFovCVBATk9nXZMAVMiMmllCmuL2VsBFVlBmbTcmG4C2/vEpaF1Xvquo2Liu12zyOIpzAKZyDB9dQg3uoQxMYIDzDK7w5j86L8+58LFoLTj5zDH/gfP4A3seM+w==</latexit>r

True

False<latexit sha1_base64="5vTpVv03DfKTanu84DeLM7Lv5qw=">AAACBXicbVDLSsNAFJ3UV62vqEtdDBahLiyJiLosdeOygn1AE8JkOmmHTibjzEQooRs3/oobF4q49R/c+TdO2yy09cCFwzn3cu89oWBUacf5tgpLyyura8X10sbm1vaOvbvXUkkqMWnihCWyEyJFGOWkqalmpCMkQXHISDscXk/89gORiib8To8E8WPU5zSiGGkjBfahJwY0yDxBx5X6CTwdQq9P7hVDXEMd2GWn6kwBF4mbkzLI0QjsL6+X4DQmXGOGlOq6jtB+hqSmmJFxyUsVEQgPUZ90DeUoJsrPpl+M4bFRejBKpCmzfar+nshQrNQoDk1njPRAzXsT8T+vm+roys8oF6kmHM8WRSmDOoGTSGCPSoI1GxmCsKTmVogHSCKsTXAlE4I7//IiaZ1V3Yuqc3tertXzOIrgAByBCnDBJaiBG9AATYDBI3gGr+DNerJerHfrY9ZasPKZffAH1ucPFg+Xqw==</latexit>

�⇡(B) � k > t

ASR decoder <latexit sha1_base64="+vtRUknzToGpVQYxGj4ys5oJCi8=">AAACLXicbZBNSwMxEIazflu/qh69BItQQcquiHqU6sGjglWhrSWbzmpsNlmSWbUs+4e8+FdE8KCIV/+G2dqDXwOBh/edITNvmEhh0fdfvJHRsfGJyanp0szs3PxCeXHp1OrUcGhwLbU5D5kFKRQ0UKCE88QAi0MJZ2Fvv/DPbsBYodUJ9hNox+xSiUhwhk7qlA/qlLYkRMiM0beOY4ZXYZShTvILGlZpC5RNDRRyNjAFZgruMKfVOt2g1+t0Pe+UK37NHxT9C8EQKmRYR53yU6ureRqDQi6Ztc3AT7CdMYOCS8hLrdRCwniPXULToWIx2HY2uDana07p0kgb9xTSgfp9ImOxtf04dJ3Fvva3V4j/ec0Uo912JlSSIij+9VGUSoqaFtHRrjDAUfYdMG6E25XyK2YYRxdwyYUQ/D75L5xu1oLtmn+8VdmrD+OYIitklVRJQHbIHjkkR6RBOLknj+SFvHoP3rP35r1/tY54w5ll8qO8j09TiqeJ</latexit>

B topb(next(B, j))

<latexit sha1_base64="jIRqA89FZ+dqJtQNc775CrippLs=">AAAB8HicbVDLSgNBEJyNrxhfUY9eBoPgKeyKqMegF48RTIwkS5idnU2GzGOZ6RXCkq/w4kERr36ON//GSbIHTSxoKKq66e6KUsEt+P63V1pZXVvfKG9WtrZ3dveq+wdtqzNDWYtqoU0nIpYJrlgLOAjWSQ0jMhLsIRrdTP2HJ2Ys1+oexikLJRkonnBKwEmPvYgPdKwB96s1v+7PgJdJUJAaKtDsV796saaZZAqoINZ2Az+FMCcGOBVsUulllqWEjsiAdR1VRDIb5rODJ/jEKTFOtHGlAM/U3xM5kdaOZeQ6JYGhXfSm4n9eN4PkKsy5SjNgis4XJZnAoPH0exxzwyiIsSOEGu5uxXRIDKHgMqq4EILFl5dJ+6weXNT9u/Na47qIo4yO0DE6RQG6RA10i5qohSiS6Bm9ojfPeC/eu/cxby15xcwh+gPv8weLgpA9</latexit>K

++<latexit sha1_base64="cCbkgdIDcgydbMU0btNaiv6R54o=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF49V7Ae0oWy2m3bpZhN2J0IJ/QdePCji1X/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB4L9csWtunOQVeLlpAI5Gv3yV28QszTiCpmkxnQ9N0E/oxoFk3xa6qWGJ5SN6ZB3LVU04sbP5pdOyZlVBiSMtS2FZK7+nshoZMwkCmxnRHFklr2Z+J/XTTG89jOhkhS5YotFYSoJxmT2NhkIzRnKiSWUaWFvJWxENWVowynZELzll1dJ66Lq1aru/WWlfpPHUYQTOIVz8OAK6nAHDWgCgxCe4RXenLHz4rw7H4vWgpPPHMMfOJ8/N2KNJw==</latexit>

t

Figure 2: Decoding for synchronized streaming ASRand E2E-ST. Speech signals are fed into the encoderchunk by chunk. For each new-coming speech chunk,we look at the current streaming ASR beam (B) to de-cide the translation policy. See details in Algorithm 1.

3.1 Streaming ASR-Guided Simultaneous ST

As shown in Fig. 2, at inference time, the speechsignals are fed into the ST encoder by a series offixed-size chunks s[1:i] = [s1, ..., si], where w =|si| can be chosen from 32, 48 and 64 frames ofspectrogram. As a result of the CNN encoder, thereis down sampling rate r (e.g., we use r = 4), fromspectrogram to encoder hidden states. For example,when we receive a chunk of 32 frames, the encoderwill generate 8 more hidden states. In conventionalstreaming ASR, the number of steps of beam searchis the same as the number of hidden states.

We denote Bj to be the beam at time step j,which is an ordered list of size of b, and it expandsto the next beam Bj+1 with the same size:

B0 =[〈<s>, pASRfull (<s> | s0;θ)〉]

Bj = topb(next(Bj−1, j))

next(B, j) ={〈z ◦ zj , p · pASRfull (zj | s≤τ(j), z;θ)〉 |

〈z, p〉 ∈ B, zj ∈ V }

where topb(·) returns the top b candidates, andnext(B, j) expands the candidates from the previ-ous step to the next step. Each candidate is a pair〈z, p〉, where z is the current prefix and p is theaccumulated probability from joint score betweenan external language model, CTC and ASR proba-bilities, pASR

full . We denote the number of observablespeech chunks at j step as τ(j) = dj ∗ r/we. Andvice versa, for each new speech chunk, ASR beamsearch will advance for w/r steps.

Note CTC often commits empty tokens ε dueto empty speech frames, and the lengths of differ-ent hypotheses within beam of streaming ASR are

A B DC1 F

G H JI

P

2

L M O3

beam search step index

1 2 43 65 7

hypo

thes

es

in b

eam

N

E

K

Q.

.

..x

x x

x

Figure 3: An example of streaming ASR beam searchwith beam size 3. LCP is shaded in red (φLCP(B7)=3);SH is highlighted in bold (φSH(B7) = 5). We use • torepresent empty outputs in some steps caused by CTC.

Algorithm 1 Streaming ASR-guided Simultaneous ST

1: Input: speech chunks s[1:T ]; k; φπ(Bj); streaming de-coding models: pST

full and pASRfull

2: Initialize: ASR and ST indices: j = t = 0; B = B0

3: for i = 1 ∼ T do . feed speech chunks4: repeat w/r steps . do ASR beam search w/r steps5: B ← topb(next(B, j)); j++ . ASR beam search6: while φπ(B)− k > t do . new tokens?7: yt+1 ← pST

wait-k(yt+1 | s[1:i+1], y≤t;θSTfull)

8: yield yt+1; t++ . commit translation to user

quite different from each other. To take every hy-pothesis into consideration, we design two policiesto decide the number of valid tokens.

• Longest Common Prefix (LCP) uses the lengthof longest shared prefix in the streaming ASRbeam as the number of valid tokens within givenspeech. This is the most conservative strategy,which has similar latency to cascaded methods.

• Shortest Hypothesis (SH) uses the length ofshortest hypothesis in the current streaming ASRbeam as the number of valid tokens.

More formally, let φπ(B) denote the number ofvalid tokens in the beam B under policy π:

φLCP(B) = max{i | ∃z′, s.t.∀〈z, c〉∈B, z≤i=z′}φSH(B) = min{|z| | 〈z, c〉 ∈ B}

For example in Fig. 3, φLCP(B7)=3, φSH(B7)=5.Also note that φLCP(B) ≤ φSH(B) for any beamB,and that both policies are monotonic, i.e. φπ(Bj) ≤φπ(Bj+1) for π ∈ {LCP,SH} and all j.

Note we always feed the entire observablespeech segments into ST for translation, andstreaming ASR-generated transcription is not usedfor translation, so LCP might have similar latencywith cascaded methods but the translation accuracyis much better because more information on thesource side is revealed to the translation decoder.

As shown in Algorithm 1, during simultaneousST, we monitor the value of φπ(Bj) while speech

Speech signal TranslationEncoder ST decoder

Transcription ASR decoder

Vanila E2E-ST model

E2E-ST with ASR MTL

Figure 4: We use full-sentence MTL framework tojointly learn ASR and ST with a shared encoder.

chunks are gradually fed into system. When wehave φπ(B) − k > t where t is the number oftranslated tokens, the ST decoder will be triggeredto generate one new token as follows:

yt = argmaxyt

pwait-k(yt | s[1:τ(j)], y<t; θSTfull) (2)

3.2 Joint Training between ST and ASRDifferent from existing simultaneous translation so-lutions from (Ren et al., 2020; Ma et al., 2020b,a),which make adaptations over vanilla E2E-ST archi-tecture as shown in gray line of Fig. 4, we insteaduse simple MTL architecture which performs jointfull-sentence training between ST and ASR:

θSTfull, θ

ASRfull = argmax

θSTfull,θ

ASRfull

∏

(s,y∗,z∗)∈D

pSTfull(y

∗ | s;θSTfull)

·pASRfull (z

∗ | s;θASRfull )

For ASR training, we use hybrid CTC/Attentionframework (Watanabe et al., 2017). Note that wetrain ASR and ST MTL with full-sentence fash-ion for simplicity and training efficiency, and onlyperform wait-k decoding policy at inference time.Also, θST

full and θASRfull share the same speech encoder.

4 Experiments

We conduct experiments on English-to-German(En→De) and English-Spanish (En→Es) transla-tion on MuST-C (Di Gangi et al., 2019). We em-ploy Transformer (Vaswani et al., 2017) as the basicarchitecture and LSTM (Hochreiter and Schmidhu-ber, 1997) for LM. For streaming ASR decodingwe use a beam size of 5. Translation decoding isgreedy due to incremental commitment.

Raw audios are processed with Kaldi (Poveyet al., 2011) to extract 80-dimensional log-Melfilterbanks stacked with 3-dimensional pitch fea-tures using a 10ms step size and a 25ms windowsize. Text is processed by SentencePiece (Kudoand Richardson, 2018) with a joint vocabulary sizeof 8K. We take Transformer (Vaswani et al., 2017)as our base architecture, which follows 2 layersof 2D convolution of size 3 with stride size of 2.The Transformer model has 12 encoder layers and

1s 2s 3s 4s 5sAL

8

12

16

20

BLEU

En-to-De (Dev)k = inf

Ma et al.(2020b) *

Ren et al.(2020) *

LCPSHCascaded

0.4 0.5 0.6 0.7 0.8 0.9 1.0AP

5

10

15

20

25

BLEU

En-to-Es (Test)k = inf

Ren et al.(2020) test-kRen et al.(2020)LCPSH

Figure 5: Translation quality v.s. latency. The dotson each curve represents different wait-k policy withk=1,3,5,7 from left to right respectively. Baseline∗ re-sults are from Ma et al. (2020b). k=inf is full-sentencedecoding for ASR and translation. test-k denotes test-ing time wait-k. We use a chunk size of 48.

6 decoder layers. Each layer has 4 attention headwith a size of 256. Our streaming ASR decodingmethod follows Moritz et al. (2020). We employ10 frames look ahead for all experiments. For LM,we use 2 layers stacked LSTM (Hochreiter andSchmidhuber, 1997) with 1024-dimensional hid-den states, and set the embedding size as 1024. LMare trained on English transcription from the cor-responding language pair in MuST-C corpus. Forthe cascaded model, we train ASR and MT modelson Must-C dataset respectively, and they have thesame Transformer architecture of our ST model.Our experiments are run on 8 1080Ti GPUs. Andthe we report the case-sensitive detokenized BLEU.

Translation quality against latency In order toclearly compare with related works, we evaluatethe latency with AL defined in Ma et al. (2020b)and AP defined in Ren et al. (2020). As shown inFig. 5, for En→De, results are on the dev set tobe consistent with Ma et al. (2020b). Comparedwith baseline models, our method achieves muchbetter translation quality with similar latency. Tovalidate the effectiveness of our method, we com-pare our method with Ren et al. (2020) on En→Estranslation. Their method does not evaluate the

chunk index 1 2 3 4 5 6 endGold transcript can I be honest SIL I don ’t love that question SILGold translation Darf ich ehrlich sein ? Ich mag diese Frage nicht .

Streaming ASR can I be on this I don ’t love that questionsimul-MT wait-3 Kann ich da sein ? “ Ich liebe diese Frage nicht .

SH wait-3 Kann ich ehrlich sein ? Ich liebe diese Frage nicht .LCP wait-3 Kann ich ehrlich sein ? Ich liebe diese Frage nicht .

Figure 6: An example from the dev set of En→De translation. In the cascaded approach (streaming ASR + simul-MT wait-3), the ASR error (“on this” for “honest”) is propagated to the MT module, causing the wrong translation(“da”). Our methods give accurate translations (“ehrlich”) with better latency (esp. for the SH policy, the outputof “diese Frage” is synchronous with hearing “that question”). “SIL” denotes silence in speech.

1s 2s 3s 4s 5s 6s 7sAL

14

16

18

20

22

BLEU

En-to-De (Test)k = inf

LCPSHCascaded

1s 2s 3s 4s 5s 6sAL

20

22

24

26

28

BLEU

En-to-Es (Test)k = inf

LCPSH

Figure 7: Translation quality against latency. Eachcurve represents decoding with wait-k policy,k=1,3,5,7 from left to right. The dashed lines andhollow markers indicate the latency considering thecomputational time. The chunk size is 48.

plausibility of the detected tokens, so it has a moreaggressive decoding policy which results in lowerlatency. However, our method can still achieve bet-ter results with slightly lower latency. Besides that,our model is trained in full-sentence mode, andonly decodes with wait-k at inference time, whichis very efficient to train. Our test-time wait-k couldachieve similar quality with their genuine wait-k(i.e., retrained) models which are very slow to train.When we compare with their test-time wait-k, ourmodel significantly outperforms theirs.

We further evaluate our method on the test setof En→De and En→Es translation. As shown inFig. 7, compared with the cascaded model, ourmodel has notable successes in latency and trans-lation quality. To verify the online usability of ourmodel, we also show computational-aware latency.Because our chunk window is 480ms, and the la-

Model En→De En→Es

w=32 w=48 w=64 w=32 w=48 w=64

LCP 17.31 17.54 17.95 21.94 21.92 22.36− LM 14.60 15.66 15.91 18.54 19.15 19.95

− LM & AD 13.76 14.82 15.26 17.42 18.06 19.32

SH 16.04 15.82 15.87 20.45 20.18 19.84− LM 13.76 14.01 13.84 17.31 17.21 17.78

− LM & AD 10.44 11.25 11.65 13.61 14.27 14.62

Table 1: BLEU score of wait-1 decoding with differentchunk sizes and ASR scoring functions. AD denotesASR Decoder. LM denotes Language Model.

tency caused by the computation is smaller thanthis window size, which means that we can finishdecoding the previous speech chunk when the nextspeech chunk needs to be processed, so our modelcan be effectively used online.

Fig. 6 demonstrates that our method can effec-tively avoid the error propagation and obtain betterlatency compared to the cascaded model.

Effect of chunk size and joint decision Table 1shows that the results are relatively stable with var-ious chunk sizes. It can be flexible to balance theresponse frequency and computational ability. Weexplore the effectiveness of ASR joint scoring, andobserve that the translation quality drops a lot with-out LM. Without LM and AD, our token recogni-tion approach is similar to the speech segmentationin Ren et al. (2020), which implies that their modelis hard to segment the source speech accurately,leading to unreliable translation decisions for ST.

5 ConclusionWe proposed a simple but effective ASR-assistedsimultaneous E2E-ST framework. The streamingASR module can guide (but not give direct input to)the wait-k policy for simultaneous translation. Ourmethod improves ST accuracy with similar latency.

Acknowledgments

This work is supported in part by NSF IIS-1817231and IIS-2009071.

ReferencesNaveen Arivazhagan, Colin Cherry, Wolfgang

Macherey, Chung-Cheng Chiu, Semih Yavuz,Ruoming Pang, Wei Li, and Colin Raffel. 2019.Monotonic infinite lookback attention for simultane-ous machine translation. Meeting of the Associationfor Computational Linguistics.

Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolf-gang Macherey, Pallavi Baljekar, and George Fos-ter. 2020. Re-translation strategies for long form,simultaneous, spoken language translation. InICASSP 2020-2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 7919–7923. IEEE.

Junkun Chen, Mingbo Ma, Renjie Zheng, and LiangHuang. 2020. Mam: Masked acoustic modelingfor end-to-end speech-to-text translation. arXivpreprint arXiv:2010.11445.

Fahim Dalvi, Nadir Durrani, Hassan Sajjad, andStephan Vogel. 2018. Incremental decoding andtraining methods for simultaneous translation in neu-ral machine translation. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers).

Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,Matteo Negri, and Marco Turchi. 2019. MuST-C: a Multilingual Speech Translation Corpus. InNAACL.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, Brussels, Belgium. Association forComputational Linguistics.

Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using prefix-to-prefix framework. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3025–3036,Florence, Italy. Association for Computational Lin-guistics.

Xutai Ma, Juan Pino, and Philipp Koehn. 2020a.SimulMT to SimulST: Adapting simultaneous texttranslation to end-to-end simultaneous speech trans-lation. In Proceedings of the 1st Conference of theAsia-Pacific Chapter of the Association for Compu-tational Linguistics and the 10th International JointConference on Natural Language Processing.

Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti,P. Koehn, and J. Pino. 2020b. Streaming simulta-neous speech translation with augmented memorytransformer. ArXiv, abs/2011.00033.

Niko Moritz, Takaaki Hori, and Jonathan Le. 2020.Streaming automatic speech recognition with thetransformer model. In ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 6074–6078.IEEE.

Yusuke Oda, Graham Neubig, Sakriani Sakti, TomokiToda, and Satoshi Nakamura. 2014. Optimizing seg-mentation strategies for simultaneous speech transla-tion. In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers).

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Na-gendra Goel, Mirko Hannemann, Yanmin Qian, PetrSchwarz, and Georg Stemmer. 2011. The kaldispeech recognition toolkit. In In IEEE 2011 work-shop.

Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:End-to-end simultaneous speech to text translation.In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30.

Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu,Guoli Ye, and Ming Zhou. 2020. Low latency end-to-end streaming speech recognition with a scoutnetwork. arXiv preprint arXiv:2003.10369.

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, andT. Hayashi. 2017. Hybrid ctc/attention architecturefor end-to-end speech recognition.

Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang,Zhongjun He, Hua Wu, and Haifeng Wang. 2019.Dutongchuan: Context-aware translation modelfor simultaneous interpreting. arXiv preprintarXiv:1907.12984.

Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,Hairong Liu, and Liang Huang. 2020a. Simultane-ous translation policies: From fixed to adaptive. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics.

Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019a. Simpler and faster learning of adap-tive policies for simultaneous translation. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP).

https://doi.org/10.18653/v1/P19-1289

https://doi.org/10.18653/v1/P19-1289

https://doi.org/10.18653/v1/P19-1289

Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019b. Simultaneous translation with flexi-ble policy via restricted imitation learning. In ACL.

Renjie Zheng, Junkun Chen, Mingbo Ma, and LiangHuang. 2021. Fused acoustic and text encoding formultimodal bilingual pretraining and speech transla-tion. Proceedings of the 38th International Confer-ence on Machine Learning.

Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu,Jiahong Yuan, Kenneth Church, and Liang Huang.2020b. Fluent and low-latency simultaneous speech-to-speech translation with self-adaptive training. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing: Findings,pages 3928–3937.

arXiv:2106.06636v1 [cs.CL] 11 Jun 2021

Documents