Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...2. Learning Embeddings Text Embedding Space • Train Word2Vec [Mikolov et

4. Experimental Settings Datasets

Details of Training•Speech2VecwithSkip-grams,window=3• Encoder:single-layerbidirectionalLSTM• Decoder:single-layerunidirectionalLSTM• SGDwithfixedlearningrateof0.001

•Word2VecfastTextimplementation•Bothdimensions=50•Discriminatorinadversarialtraining• 2layers,512neurons,ReLU

5. Results

Task I | Spoken Word Recognition• Accuracydecreasesasthelevelofsupervisiondecreases• Unsupervisedalignmentapproachisalmostaseffectiveasitsu-pervisedcounterpart(Avs.A*)• Wordsegmentationisacriticalstep• Applytodifferentcorporasettings

Task II | Spoken Word Synonyms Retrieval• Theoutputactuallycontainbothsynonymsanddifferentlexicalformsoftheaudiosegment• Alsoconsidersynonymsasvalidresults

Task III | Spoken Word Translation• Moresupervisionyieldsbetterperformance• Translationusingthesamecorpusoutperformsthoseusingdiffer-entcorpora

2. Learning Embeddings

Text Embedding Space• TrainWord2Vec[Mikolovetal.,2013]onthetextcorpus• Unsupervisedlearningofdistributedwordrepresentationsthatmodelwordsemantics

Speech Embedding Space• TrainSpeech2Vec[ChungandGlass,2018]onthespeechcorpus• Thecorpusispre-processedbyanoff-the-shelfspeechsegmentationalgorithmsuchthatutterancesaresegmentedintoaudiosegmentscorrespondingtospokenwords• SpeechversionofWord2Vec:unsupervisedsemanticaudiosegmentrepresentations

Unsupervised Cross-Modal Alignment of Speech and Text Embedding SpacesYu-An Chung, Wei-Hung Weng, Schrasing Tong, James GlassComputer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA

1. Overview

• Goal: To learn a linear mapping between speech & text embedding spaces

3. Embedding Spaces Alignment

References•ChungandGlass.Speech2vec:Asequence-to-sequenceframeworkforlearningwordembed-dingsfromspeech.INTERSPEECH2018.•Lampleetal.Wordtranslationwithoutparalleldata.ICLR2018.•Mikolovetal.Distributedrepresentationsofwordsandphrasesandtheircompositionality.NIPS2013.

• Bothembeddingspacesarelearnedfromcorporabasedondistri-butionalhypothesis(e.g.,skip-grams)→approximatelyisomorphic• Constructthesyntheticmappingdictionarytolearnalinearmappingmatrixbetweenthetwoembeddingspaces

Adversarial Training• Makethealignedembeddingsindistinguishable

Refinement (Orthogonal Procrustes Problem)• UsetheWlearnedfromtheadversarialtrainingstepasaninitialproxyandbuildasyntheticparalleldictionary• Considerthemostfrequentwords

Method Comparison

Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...2. Learning Embeddings Text Embedding Space • Train Word2Vec [Mikolov et

Documents