Top Banner
PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription Chen Zhang 1 , Jiaxing Yu 1 , LuChin Chang 1 , Xu Tan 2 , Jiawei Chen 3 , Tao Qin 2 , Kejun Zhang 1 1 Zhejiang University, China 2 Microsoft Research Asia 3 South China University of Technology [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract Automatic lyrics transcription (ALT), which can be regarded as automatic speech recognition (ASR) on singing voice, is an interesting and practical topic in academia and industry. ALT has not been well developed mainly due to the dearth of paired singing voice and lyrics datasets for model training. Considering that there is a large amount of ASR training data, a straightforward method is to leverage ASR data to enhance ALT training. However, the improvement is marginal when training the ALT system directly with ASR data, because of the gap between the singing voice and standard speech data which is rooted in music-specific acoustic characteris- tics in singing voice. In this paper, we propose PDAugment, a data augmentation method that adjusts pitch and duration of speech at syllable level under the guidance of music scores to help ALT training. Specifically, we adjust the pitch and duration of each syllable in natural speech to those of the cor- responding note extracted from music scores, so as to narrow the gap between natural speech and singing voice. Experi- ments on DSing30 and Dali corpus show that the ALT system equipped with our PDAugment outperforms previous state- of-the-art systems by 5.9% and 18.1% WERs respectively, demonstrating the effectiveness of PDAugment for ALT. Content Areas – automatic lyrics transcription; data augmentation; automatic speech recognition; singing voice. 1 Introduction Automatic lyrics transcription (ALT), which recognizes lyrics from singing voice, is useful in many applica- tions, such as lyrics-to-music alignment, query-by-singing, karaoke performance evaluation, keyword spotting, and so on. ALT on singing voice can be regarded as a counterpart of automatic speech recognition (ASR) on natural speech. Al- though ASR has witnessed rapid progress and brought con- venience to people in daily life in recent years (Graves et al. 2006; Graves 2012; Chan et al. 2016; Park et al. 2019; Li et al. 2020; Xu et al. 2020), there is not an ALT system that has the same level of high accuracy and robustness as the current ASR systems. The main challenge of developing a robust ALT system is the scarcity of available paired singing voice and lyrics datasets that can be used for the ALT model training. To make matters worse, compared with ASR, ALT is a more challenging task – the same content accompanied with different melodies will produce different pitches and duration, which leads to the sparsity of training data and fur- ther aggravates the problem of the lack of data. Though a straightforward method is to use speech data to enhance the training data of ALT, the performance gain is not large be- cause there are significant differences between speech and singing voice. For example, the singing voice have some music-specific acoustic characteristics (Kruspe and Fraun- hofer 2016; Mesaros and Virtanen 2009) (details can be seen in Section 2.2) – the large variation of syllable duration and highly flexible pitch contours are very common in singing, but rarely be seen in speech (Tsai, Tuan, and Lee 2018). Previous work has already made some attempts in using speech data to improve the performance of ALT system: Fu- jihara et al. (2006) pretrain a recognizer with speech data and then built a language model containing the vowel sequences in the lyrics to further introduce the knowledge from the mu- sic domain. However, it only took advantage of the semantic information from lyrics and did not consider the acoustic properties of singing voice. Mesaros and Virtanen (2009) adapted a pre-trained GMM-HMM based speech recognizer to singing voice domain by speaker adaptation technique, but it only shifted the means and variances of GMM compo- nents based on global statistics without considering the local information, resulting in very limited improvement. Some other work (Kruspe and Fraunhofer 2015; Basak et al. 2021) tried to artificially generate “song-like” data from speech for model training. Kruspe and Fraunhofer (2015) applied time stretching and pitch shifting to natural speech in a random manner, which enriches the distribution of pitch and duration in “songified” speech data to a certain extent. Nonetheless, the adjustments are random, so there is still a gap between the patterns of “songified” speech data and those of real singing voice. Compared to Kruspe and Fraunhofer (2015), Basak et al. (2021) further took advantage of real singing voice data. It transferred natural speech to singing voice do- main with the guidance of real opera data. But it only took the pitch contours into account, ignoring duration, another key characteristic. Besides, it directly replaced the pitch con- tours with those of the real opera data, without considering the alignment of the note and syllable, which may result in the low quality of synthesized audio. In this paper, we propose PDAugment, a syllable-level data augmentation method by adjusting pitch and duration under the guidance of music scores to generate more con- arXiv:2109.07940v2 [eess.AS] 17 Sep 2021
9

PDAugment: Data Augmentation by Pitch and Duration ...

Oct 16, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PDAugment: Data Augmentation by Pitch and Duration ...

PDAugment: Data Augmentation by Pitch and Duration Adjustments forAutomatic Lyrics Transcription

Chen Zhang1, Jiaxing Yu1, LuChin Chang1, Xu Tan2, Jiawei Chen3, Tao Qin2, Kejun Zhang1

1Zhejiang University, China2Microsoft Research Asia

3South China University of [email protected], [email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract

Automatic lyrics transcription (ALT), which can be regardedas automatic speech recognition (ASR) on singing voice, isan interesting and practical topic in academia and industry.ALT has not been well developed mainly due to the dearthof paired singing voice and lyrics datasets for model training.Considering that there is a large amount of ASR training data,a straightforward method is to leverage ASR data to enhanceALT training. However, the improvement is marginal whentraining the ALT system directly with ASR data, becauseof the gap between the singing voice and standard speechdata which is rooted in music-specific acoustic characteris-tics in singing voice. In this paper, we propose PDAugment,a data augmentation method that adjusts pitch and durationof speech at syllable level under the guidance of music scoresto help ALT training. Specifically, we adjust the pitch andduration of each syllable in natural speech to those of the cor-responding note extracted from music scores, so as to narrowthe gap between natural speech and singing voice. Experi-ments on DSing30 and Dali corpus show that the ALT systemequipped with our PDAugment outperforms previous state-of-the-art systems by 5.9% and 18.1% WERs respectively,demonstrating the effectiveness of PDAugment for ALT.

Content Areas – automatic lyrics transcription; dataaugmentation; automatic speech recognition; singing voice.

1 IntroductionAutomatic lyrics transcription (ALT), which recognizeslyrics from singing voice, is useful in many applica-tions, such as lyrics-to-music alignment, query-by-singing,karaoke performance evaluation, keyword spotting, and soon. ALT on singing voice can be regarded as a counterpart ofautomatic speech recognition (ASR) on natural speech. Al-though ASR has witnessed rapid progress and brought con-venience to people in daily life in recent years (Graves et al.2006; Graves 2012; Chan et al. 2016; Park et al. 2019; Liet al. 2020; Xu et al. 2020), there is not an ALT system thathas the same level of high accuracy and robustness as thecurrent ASR systems. The main challenge of developing arobust ALT system is the scarcity of available paired singingvoice and lyrics datasets that can be used for the ALT modeltraining. To make matters worse, compared with ASR, ALTis a more challenging task – the same content accompaniedwith different melodies will produce different pitches and

duration, which leads to the sparsity of training data and fur-ther aggravates the problem of the lack of data. Though astraightforward method is to use speech data to enhance thetraining data of ALT, the performance gain is not large be-cause there are significant differences between speech andsinging voice. For example, the singing voice have somemusic-specific acoustic characteristics (Kruspe and Fraun-hofer 2016; Mesaros and Virtanen 2009) (details can be seenin Section 2.2) – the large variation of syllable duration andhighly flexible pitch contours are very common in singing,but rarely be seen in speech (Tsai, Tuan, and Lee 2018).

Previous work has already made some attempts in usingspeech data to improve the performance of ALT system: Fu-jihara et al. (2006) pretrain a recognizer with speech data andthen built a language model containing the vowel sequencesin the lyrics to further introduce the knowledge from the mu-sic domain. However, it only took advantage of the semanticinformation from lyrics and did not consider the acousticproperties of singing voice. Mesaros and Virtanen (2009)adapted a pre-trained GMM-HMM based speech recognizerto singing voice domain by speaker adaptation technique,but it only shifted the means and variances of GMM compo-nents based on global statistics without considering the localinformation, resulting in very limited improvement. Someother work (Kruspe and Fraunhofer 2015; Basak et al. 2021)tried to artificially generate “song-like” data from speech formodel training. Kruspe and Fraunhofer (2015) applied timestretching and pitch shifting to natural speech in a randommanner, which enriches the distribution of pitch and durationin “songified” speech data to a certain extent. Nonetheless,the adjustments are random, so there is still a gap betweenthe patterns of “songified” speech data and those of realsinging voice. Compared to Kruspe and Fraunhofer (2015),Basak et al. (2021) further took advantage of real singingvoice data. It transferred natural speech to singing voice do-main with the guidance of real opera data. But it only tookthe pitch contours into account, ignoring duration, anotherkey characteristic. Besides, it directly replaced the pitch con-tours with those of the real opera data, without consideringthe alignment of the note and syllable, which may result inthe low quality of synthesized audio.

In this paper, we propose PDAugment, a syllable-leveldata augmentation method by adjusting pitch and durationunder the guidance of music scores to generate more con-

arX

iv:2

109.

0794

0v2

[ee

ss.A

S] 1

7 Se

p 20

21

Page 2: PDAugment: Data Augmentation by Pitch and Duration ...

sistent training data for ALT training. In order to narrow thegap between the adjusted speech and singing voice, we ad-just the speech at syllable level to make it more in line withthe characteristics of the singing voice. We try to make thespeech more closely fit the music score, so as to achieve theeffect of “singing out” the speech content. PDAugment ad-justs natural speech data by the following steps: 1) extractsnote information from music scores to get the pitch and du-ration patterns of music; 2) aligns the speech and notes atsyllable level; 3) adjusts the pitch and duration of sylla-bles in natural speech to match those in aligned note. Bydoing so, PDAugment can add the information of music-specific acoustic characteristics into adjusted speech in amore reasonable manner, and narrow the gap between ad-justed speech and singing voice.

Our contributions can be summarized as follows:• We develop PDAugment, a data augmentation method to

enhance the training data for ALT training, by adjustingthe pitch and duration of natural speech at syllable levelwith note information extracted from real music scores.

• We conduct experiments in two singing voice datasets:DSing30 dataset and Dali corpus. The adjusted Lib-riSpeech corpus is combined with the singing voice cor-pus for ALT training. In DSing30 dataset, the ALT sys-tem with PDAugment outperforms previous state-of-the-art system and Random Aug by 5.9% and 7.8% WERsrespectively. In Dali corpus, PDAugment outperformsthem by 24.9% and 18.1% WERs respectively. Com-pared to adding ASR data directly into the training data,our PDAugment has 10.7% and 21.7% WERs reductionin two datasets.

• We analyze the adjusted speech by statistics and visual-ization, and find that PDAugment can significantly com-pensate the gap between speech and singing voice. At thesame time, the adjusted speech can keep the relativelygood quality (the audio samples can be found in supple-mentary materials).

2 BackgroundIn this section, we introduce the background of this work,including an overview of previous work related to automaticlyrics transcription and the differences between speech andsinging voice.

2.1 Automatic Lyrics TranscriptionThe lyrics of a song provide the textual information ofsinging voice and are as important as the melody when con-tributing to the emotional perception for listeners (Ali andPeynircioglu 2006). Automatic lyrics transcription (ALT)aims to recognize lyrics from singing voice. In auto-matic music information retrieval and music analysis, lyricstranscription plays a role as important as melody extrac-tion (Hosoya et al. 2005). However, ALT is a more chal-lenging task than ASR – not like speech, in singing voice,the same content aligned with different melodies will havedifferent pitches and duration, which results in the sparsityof training data. So it is more difficult to build an ALT sys-tem than an ASR system without enough training data.

Some work took advantage of the characteristics of mu-sic itself: Gupta, Li, and Wang (2018) extended the lengthof pronounced vowels in output sequences by increasing theprobability of a frame with the same phoneme after a certainvowel frame because there are a lot of long vowels in singingvoice. Kruspe and Fraunhofer (2016) boosted the ALT sys-tem by using the newly generated alignment (Mesaros andVirtanen 2008) of singing and lyrics. Gupta, Yılmaz, and Li(2020) tried to make use of the background music as extrainformation to improve the recognition accuracy. However,they just designed some hard constraints or added extra in-formation according to the knowledge from the music do-main, and still did not solve the problem of data scarcity.

Considering the lack of singing voice database, somework aimed at providing a relatively large singing voicedataset: Dabike and Barker (2019) collected DSing datasetfrom real-world user information. Demirel, Ahlback, andDixon (2020) built a cascade pipeline with convolutionaltime-delay neural networks with self-attention based on DS-ing30 dataset and provided a baseline for ALT task. Someother work leveraged natural speech data for ALT train-ing: they based on pre-trained automatic speech recognitionmodels and then made some adaptations to improve the per-formance on singing voice: Fujihara et al. (2006) built a lan-guage model containing the vowel sequences in the lyricsbut only used the semantic information from lyrics and ig-nored the acoustic properties of singing voice. Mesaros andVirtanen (2009) used speaker adaptation technique by shift-ing the GMM components only with global statistics but didnot consider the local information.

However, singing voice has some music-specific acous-tic characteristics which are not in speech, limiting the per-formance when training the ALT system directly with natu-ral speech data. Some work tried to synthesize “song-like”data from natural speech to make up for this gap: Kruspeand Fraunhofer (2015) generated “songified” speech databy time stretching, pitch shifting, and adding vibrato. How-ever, the degrees of these adjustments are randomly selectedwithin a range, without using the patterns in real music.Basak et al. (2021) took use of the F0 contours in real operadata and converted the speech to singing voice through styletransfer. Specifically, they decomposed the F0 contours fromthe real opera data, obtained the spectral envelope and theaperiodic parameter from the natural speech, and then usedthese parameters to synthesize the singing voice version ofthe original speech. Nonetheless, in real singing voice, thenote and the syllable are often aligned, but Basak et al.(2021) did not perform any alignment of the F0 contoursof singing voice with the speech signal. This misalignmentmay lead to the change of pitch within a consonant phoneme(in normal circumstances, the pitch only changes betweentwo phonemes or in vowels), which further causes distortionof the synthesized audio and limits the performance of ALTsystem. Besides, they only adjusted the F0 contours, whichis not enough to narrow the gap between speech and singingvoice.

In this paper, we propose PDAugment, which improvesthe above adjustment methods by using real music scoresand syllable-level alignment to adjust the pitch and duration

Page 3: PDAugment: Data Augmentation by Pitch and Duration ...

of natural speech, so as to solve the problem of insufficientsinging voice data.

2.2 Speech vs. Singing VoiceIn a sense, singing voice can be considered as a specialform of speech, but there are still a lot of discrepanciesbetween them (Loscos, Cano, and Bonada 1999; Mesaros2013; Kruspe and Fraunhofer 2014). These discrepanciesmake it inappropriate to transcribe the lyrics from singingvoice using a speech recognition model trained on ASR datadirectly. In order to demonstrate the discrepancies, we ran-domly select 10K sentences from LibriSpeech (Panayotovet al. 2015) for speech corpus and Dali (Meseguer-Brocal,Cohen-Hadria, and Peeters 2019) for singing voice dataset,and make some statistics on them. The natural speech andsinging voice mainly differ in the following aspects and theanalysis results are list in Table 1.

Pitch We extract the pitch contours from singing voice andspeech, and compare the range and smoothness of the pitch.Here we use semitone as the unit of pitch.Pitch Range Generally speaking, the range of the pitch insinging voice is larger than that in speech. Loscos, Cano, andBonada (1999) has pointed out that the frequency range insinging voice can be much larger compared to that in speech.For each sentence, we calculate the pitch range (the max-imum pitch value minus the minimum pitch value in thissentence). After averaging the pitch range of overall 10Ksentences in the corpus, the average values are listed in PitchRange in Table 1.Pitch Smoothness The pitch of each frame in a certain sylla-ble (when it is corresponding to a note) in singing voice re-mains almost constant whereas in speech the pitch changesfreely along with the audio frames in a syllable. We call thecharacteristic of maintaining local stability within a sylla-ble as Pitch Smoothness. Specifically, we calculate the pitchdifference between every two adjacent frames in a sentenceand average it across the entire corpus of 10K sentences.The smaller the value of Pitch Smoothness, the smoother thepitch contour. The results can be seen in Pitch Smoothnessof Table 1.

Duration We also analyze and compare the range and sta-bility of the syllable duration in singing voice and speech.The duration of each syllable varying a lot along with themelody in singing voice. While in speech, it depends on thepronunciation habits of the certain speaker.Duration Range For each sentence, we calculate the differ-ence between the duration of the longest syllable and short-est syllable as the duration range. The average values of theduration ranges in the entire corpus are shown as DurationRange in Table 1.Duration Variance We calculate the variance of the durationof syllables in each sentence and average the variances ofall sentences in the whole corpus. The results are listed asDuration Variance in Table 1 to reflect the flexibility of du-ration in singing voice.

Besides the differences in the characteristics we men-tioned above, sometimes singers may add vibrato in some

Property Speech Singing Voice

Pitch Range (semitone) 12.71 14.61Pitch Smoothness 0.93 0.84Duration Range (s) 0.44 2.40Duration Variance 0.01 0.11

Table 1: The differences of acoustic properties betweenspeech and singing voice.

long vowels or make artistic modifications to the pronunci-ation of some words to make them sound more melodious,though it will result in a loss of intelligibility. Consideringthat some characteristics are hard to be quantified, in thiswork, we start with pitch and duration to build a prototypeand propose PDAugment to augment ALT training data byadjusting pitch and duration of speech at syllable level ac-cording to music scores.

3 MethodIn this section, we introduce the details of our proposedPDAugment: a data augmentation method by music score-guided syllable-level pitch and duration adjustments for au-tomatic lyrics transcription. We first describe the overallpipeline of the ALT system and then introduce the designsof each component in PDAugment respectively.

������ �������

���������� ������� ����

�����������������

�������������

������� ������� �

�������"��

���

��������������

�����������

�������� ���

������� ��� ��� �

��������������

���������������

���

�������� ���

��!�

������������������

Figure 1: The overall pipeline of the ALT system equippedwith PDAugment.

3.1 Pipeline OverviewFor automatic lyrics transcription, we follow the practiceof existed automatic speech recognition system (Watanabeet al. 2018) and choose Conformer encoder (Gulati et al.2020) and Transformer decoder (Vaswani et al. 2017) as ourbasic model architecture. Different from standard ASR sys-tems, we add a PDAugment module in front of the encoderas shown in Figure 1, to apply syllable-level adjustments

Page 4: PDAugment: Data Augmentation by Pitch and Duration ...

to pitch and duration of the input natural speech accord-ing to the information of aligned notes. When the input ofALT system is singing voice, we just do not enable PDAug-ment module. When the input is speech, PDAugment mod-ule takes the note information extracted from music scoresas extra input to adjust the pitch and duration of speech, andthen adds the adjusted speech into training data to enhancethe ALT model. The loss function of the ALT model con-sists of decoder loss Ldec, and ctc loss (on top of encoder)Lctc (Wang et al. 2020): L = (1−λ)Ldec +λLctc, where λis a hyperparameter to trade-off the two loss terms. Consid-ering that lyrics may contain more musical-specific expres-sions which are rarely seen in natural speech, the probabilitydistributions of lyrics and standard text are quite different.We train the language model with in-domain lyrics data andthen fuse it with ALT model in the beam search of decodingstage.

We try to make the speech fit the patterns of singing voicemore naturally as well as to achieve the effect of “singingout” the speech by PDAugment, so we adjust the pitch andduration of speech at syllable level according to those of cor-responding notes in music scores instead of applying ran-dom adjustments. To do so, we propose PDAugment mod-ule, which consists of three key components: 1) speech-notealigner, which generates the syllable-level alignment to de-cide what the corresponding note of a certain syllable is forsubsequent adjusters; 2) pitch adjuster, which adjusts thepitch of each syllable in speech according to that of alignednotes; and 3) duration adjuster, which adjusts the duration ofeach syllable in speech to be in line with the duration of thecorresponding notes. We introduce each part in the follow-ing subsections.

3.2 Speech-Note AlignerAccording to linguistic and musical knowledge, in singingvoice, syllable can be viewed as the smallest textual unit cor-responds to note. PDAgument adjusts the pitch and durationof natural speech at syllable level under the guidance of noteinformation obtained from the music scores. In order to ap-ply the syllable-level adjustments, we propose speech-notealigner, which aims to align the speech and note (in melody)at syllable level.

The textual content can serve as the bridge of aligningnote of melody with the speech. Specifically, our speech-note aligner aligns the speech with note (in melody) in thefollowing steps:1) In order to obtain the syllable-level alignment of text andspeech, we first convert the text to phoneme by an open-source tool1 and then align the text with speech audio by theMontreal forced aligner (MFA) (McAuliffe et al. 2017) tool2at phoneme level. Next, we group several phonemes into asyllable according to the linguistic rules (Kearns 2020) andget the syllable-level alignment of text and speech.2) For the syllable-to-note mappings, we set one syllable tocorrespond to one note by default, because in most cases of

1https://github.com/bootphon/phonemizer2https://github.com/MontrealCorpusTools/Montreal-Forced-

Aligner

singing voice, one syllable is aligned with one note. Onlywhen the time length ratio of the syllable in speech and thenote in melody exceeds the predefined thresholds (we setthe 0.5 as lower bound and 2 as upper bound in practice),we generate one-to-many or many-to-one mappings to pre-vent audio distortion after adjustments.3) We aggregate the syllable-level alignment of text andspeech, and the syllable-to-note mappings to generate thesyllable-level alignment of speech and note (in melody) asthe input of the pitch and duration adjusters.

3.3 Pitch AdjusterPitch adjuster adjusts the pitch of input speech at syllablelevel according to the aligned notes. Specifically, we useWORLD (Morise, Yokomori, and Ozawa 2016), a fast andhigh-quality vocoder-based speech synthesis system to im-plement the adjustment. The WORLD system3 parameter-izes speech into three components: fundamental frequency(F0), aperiodicity, and spectral envelope and can reconstructthe speech with only estimated parameters. We use WORLDto estimate the three parameters of natural speech and onlyadjust the F0 contours according to that of correspondingnotes. Then we synthesize speech with adjusted F0 accom-panied with original aperiodicity and spectral envelope. Fig-ure 2 shows the F0 contours before and after pitch adjuster.

ow pax nihng hihz daor

Speech

AugmentedSpeech

Note

Phoneme

Figure 2: The change of F0 contour after pitch adjuster. Thecontent of this example is “opening his door”.

Pitch adjuster calculates the pitch difference betweenspeech and note with syllable-level alignment and adjustsF0 contours of speech accordingly. Some details are as fol-lows: 1) Considering that the quality of synthesized speechwill drop sharply when the range of adjustment is too large,we need to keep it within a reasonable threshold. Specifi-cally, we calculate the average pitch of the speech and thecorresponding melody respectively. When the average pitchof speech is too different from that of the correspondingmelody (e.g., exceeding a threshold, which is 5 in our ex-periment), we shift the pitch of the entire note sequence tomake the difference within the threshold and use the shiftednote for adjustment, otherwise, keep the pitch of the original

3https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder

Page 5: PDAugment: Data Augmentation by Pitch and Duration ...

note unchanged; 2) To maintain smooth transitions in syn-thesized speech and prevent speech from being interrupted,we perform pitch interpolation for the frames between twosyllables. 3) When a syllable is mapped to multiple notes,we segment the speech of this syllable in proportion to theduration of notes and adjust the pitch of each segment ac-cording to the corresponding note.

3.4 Duration AdjusterDuration adjuster changes the duration of input speech toalign with the duration of the corresponding note. As shownin Figure 3, instead of scaling the whole syllable, we onlyscale the length of vowels and keep the length of con-sonants unchanged, because the duration of consonants insinging voice is not significantly longer than that in speech,while long vowels are common in singing voice (Kruspeand Fraunhofer 2015). There are one-to-many mappingsand many-to-one mappings in the syllable-level alignment.When multiple syllables are mapped to one note, we calcu-late the total length of the syllables and adjust the length ofall vowels in these syllables in proportion. When multiplenotes are mapped to one syllable, we adjust the length of thevowel, so that the total length of this syllable is equal to thetotal length of these notes.

Speech

AugmentedSpeech

h ih z d ao rPhoneme

Note

Figure 3: The change of duration after duration adjuster. Thecontent of this example is “his door”. The left block showsthe case of lengthening the duration of speech and the rightshows the case of shortening the duration.

4 Experimental SettingsIn this section, we describe the experimental settings, in-cluding datasets, model configuration, and the details oftraining, inference, and evaluation.

4.1 DatasetsSinging Voice Datasets We conduct experiments on twosinging voice datasets to verify the effectiveness of ourPDAugment: DSing30 dataset (Dabike and Barker 2019)and Dali corpus (Meseguer-Brocal, Cohen-Hadria, andPeeters 2019). DSing30 dataset consists of about 4K mono-phonic Karaoke recordings of English pop songs with nearly80K utterances, performed by 3,205 singers. We use the par-tition provided by Dabike and Barker (2019) to make a faircomparison with them. Dali corpus is another large datasetof synchronized audio, lyrics, and notes. It consists of 1200English polyphonic songs with a total duration of 70 hours.Following Basak et al. (2021), we use the sentence-level

annotation of the dataset provided by Meseguer-Brocal,Cohen-Hadria, and Peeters (2019)4 and divide the datasetinto training, development, and test with a proportion of8:1:1 without any singers overlapping in each partition. Weconvert all the singing voice waveforms in our experimentsinto mel-spectrogram following Zhang et al. (2020b) with aframe size of 25 ms and hop size of 10 ms.

Natural Speech Dataset Following the common practicein previous ASR work (Amodei et al. 2016; Gulati et al.2020; Zhang et al. 2020c), we choose the widely usedLibriSpeech (Panayotov et al. 2015) corpus as the naturalspeech dataset in our experiments. The LibriSpeech corpuscontains 960 hours of speech sampled at 16 kHz with 1129female speakers and 1210 male speakers. We use the offi-cial training partition described in Panayotov et al. (2015).Similar to singing voice, we convert the speech into mel-spectrogram with the same setting.

Music Score Dataset In this work, we choose theFreeMidi dataset5 and only use the pop songs among them6

because almost all of the songs in our singing voice datasetsare pop music. The pop music subset of FreeMidi has about4000 MIDI files, which are used to provide note informationfor PDAugment module. We use miditoolkit7 to extract thenote information from MIDI files and then feed them intothe PDAugment module along with natural speech.

Lyrics Corpus In order to construct a language modelwith more in-domain knowledge, we deliberately collecteda large amount of lyrics data to build our lyrics corpus forlanguage model training. Besides the training text of DS-ing30 dataset and Dali corpus, we collect lyrics of Englishpop songs from the Web. We crawl about 46M lines of lyricsand obtain nearly 17M sentences after removing the dupli-cation. We attach a subset of the collected lyrics corpus inthe supplementary material.

4.2 Model ConfigurationALT Model We choose Conformer encoder (Gulati et al.2020) and Transformer decoder (Vaswani et al. 2017) as thebasic model architecture in our experiments since the effec-tiveness of the structure has been proved in ASR. We stackN = 12 layers of Conformer blocks in encoder and N = 6layers of Transformer blocks in decoder. The hidden size ofboth Conformer blocks and Transformer blocks are set to512, and the filter size of the feed-forward layer is set to2048. The number of the attention head is set to 8 and thedropout rate is set to 0.1.

Language Model Our language model is based on Trans-former encoder with 16 layers, 8 heads of attention, filtersize of 2048, and embedding unit of 128. The languagemodel is pre-trained separately and then integrated with ALTmodel in the decoding stage.

4https://github.com/gabolsgabs/DALI5https://freemidi.org6https://freemidi.org/genre-pop7https://github.com/YatingMusic/miditoolkit

Page 6: PDAugment: Data Augmentation by Pitch and Duration ...

4.3 Training DetailsDuring ALT training, after PDAugment module, we applythe SpecAugment (Park et al. 2019) and speed perturbation.We use SpecAugment with frequency mask from 0 to 30bins, time mask from 0 to 40 frames, time warp window of5. The speed perturbation factors are set to 0.9 and 1.1. Wetrain the ALT model for 35 epochs on 2 GeForce RTX 3090GPUs with a batch size of 175K frames. Following Zhanget al. (2020a), we use Adam optimizer (Kingma and Ba2014) and set β1, β2, ε to 0.9, 0.98 and 10−9 respectively.We apply label smoothing with 0.1 weight when calculatingthe Ldec and set the λ described in Section 3.1 to 0.3. Wetrain the language model for 25 epochs on 2 GeForce RTX3090 GPUs with Adam optimizer.

Our code of basic model architecture is implementedbased on the ESPnet toolkit (Watanabe et al. 2018)8. We at-tach our code of PDAugment module to the supplementarymaterials.

4.4 Inference and EvaluationDuring inference, we fuse the ALT model with the languagemodel which is pre-trained with lyrics corpus. Followingprevious ALT work (Demirel, Ahlback, and Dixon 2020;Basak et al. 2021), we use the word error rate (WER) as themetric when evaluating the accuracy of lyrics transcription.

5 Results and AnalysesIn this section, we first report the main experiment results,and then conduct ablation studies to verify the effectivenessof each component in PDAugment, finally analyze the ad-justed speech by statistics and visualization.

5.1 Main ResultsIn this subsection, we report the experimental results ofthe ALT system equipped with PDAugment in two singingvoice datasets. We compare our results with several basicsettings as baselines: 1) Naive ALT, the ALT model trainedwith only singing voice dataset; 2) ASR Augmented, the ALTmodel trained with the combination of singing voice datasetand ASR data directly; 3) Random Aug (Kruspe and Fraun-hofer 2015), the ALT model trained with the combination ofsinging voice dataset and randomly adjusted ASR data. Thepitch is adjusted ranging from -6 to 6 semitones randomlyand the duration ratio of speech before and after adjustmentis randomly selected from 0.5 to 1.2. All of 1), 2), and 3) areusing the same model architecture as PDAugment. Besidesthe above three baselines, we compare our PDAugment withthe previous systems which reported the best results in twodatasets respectively. For DSing30 dataset, we compare ourPDAugment with Demirel, Ahlback, and Dixon (2020) us-ing RNNLM9 and the results are shown in Table 2. For Dalicorpus, we compare the results with Basak et al. (2021) andreport the results in Table 3.

As can be seen, Naive ALT performs not well and getshigh WERs in both DSing30 dataset and Dali corpus, which

8https://github.com/espnet/espnet9https://github.com/emirdemirel/ALTA

Method Dev Test

Naive ALT 28.2 27.4ASR Augmented 20.8 20.5Random Aug (Kruspe and Fraunhofer 2015) 17.9 17.6Demirel, Ahlback, and Dixon (2020) 17.7 15.7

PDAugment 10.1 9.8

Table 2: The WERs (%) of DSing30 dataset.

Method Dev Test

Naive ALT 80.9 86.3ASR Augmented 75.5 75.7Random Aug (Kruspe and Fraunhofer 2015) 69.8 72.1Basak et al. (2021) 75.2 78.9

PDAugment 53.4 54.0

Table 3: The WERs (%) of Dali corpus.

demonstrates the difficulty of the ALT task. After addingASR data for ALT training, the performances of ASR Aug-mented setting in both of the two datasets have been im-proved slightly compared to Naive ALT, but still with rel-atively high WERs, which indicates the limitation of usingASR training data directly.

When applying the adjustments, a question is that if weadjust the pitch and duration with random ranges withoutnote information from music scores, how well will the ALTsystem perform? The results of Random Aug can perfectlyanswer this question. As the results show, Random Aug canslightly improve the performance compared with ASR Aug-mented, demonstrating that increasing the volatility of pitchand duration in natural speech helps ALT system training,which is the same as what Kruspe and Fraunhofer (2015)claimed. PDAugment is significantly better than RandomAug, which indicates that adjusted speech can better helpthe ALT training with the guidance of music scores.

Besides, it is obvious that PDAugment greatly out-performs the previous SOTA in both datasets. Demirel,Ahlback, and Dixon (2020) in Table 2 performs worse thanPDAugment because of not taking advantage of the massiveASR training data. Compared with Basak et al. (2021) in Ta-ble 3 that replaced F0 contours of speech directly, PDAug-ment can narrow the gap between natural speech and singingvoice in a more reasonable manner and achieve the lowestWERs among all the above methods. The results in bothdatasets show the effectiveness of PDAugment for ALT taskand reflect the superiority of adding music-specific acousticcharacteristics into natural speech.

5.2 Ablation Studies

We conduct more experimental analyses to deeply exploreour PDAugment and verify the necessity of some design de-tails. More ablation studies (different language models) canbe found in the supplementary materials. The ablation stud-ies are carried out on DSing30 dataset in this section.

Page 7: PDAugment: Data Augmentation by Pitch and Duration ...

(a) Original speech sample. (b) After Pitch Adjuster. (c) After Duration Adjuster. (d) After PDAugment.

Figure 4: Spectrograms of speech example after pitch adjuster or/and duration adjuster.

Augmentation Types In this subsection, we explore theeffects of different kinds of augmentations (only adjustpitch, only adjust duration, and adjust pitch & duration)on increasing the performance of ALT system. We gener-ate three types of adjusted speech by enabling different ad-justers of PDAugment module and conduct the experimentson DSing30. The results are shown in Table 4.

Setting DSing30 Dev DSing30 Test

PDAugment 10.1 9.8- Pitch Adjuster 13.6 13.4- Duration Adjuster 13.8 13.8- Pitch & Duration Adjusters 20.8 20.5

Table 4: The WERs (%) of different types of augmentationof DSing30 dataset. All of the settings are trained on DS-ing30 and the original or adjusted LibriSpeech data.

As we can see, when we enable the whole PDAumgnetmodule, the ALT system can achieve the best performance,indicating that the effectiveness of the pitch and duration ad-justers. When we disable the pitch adjuster, the WER onDSing30 is 3.6% higher than PDAugment. The same thinghappens when we disable the duration adjuster, the WERis 4.0% higher than PDAugment. And if both the pitch andduration are not adjusted, which means use the speech datadirectly for ALT training, the WER is the worst among allsettings. The results demonstrate that both pitch and dura-tion adjusters are necessary and can help with improving therecognition accuracy of the ALT system.

5.3 Adjusted Speech AnalysesStatistics of adjusted Speech Following Section 2.2, weanalyze the acoustic properties of the original natural speechand the adjusted speech by PDAugment, and list the resultsin Table 5.

Combining the information of Table 1 and Table 5, wecan clearly find that the distribution pattern of acoustic prop-erties (pitch and duration) after PDAugment is closer tosinging voice compared with the original speech, which in-dicates that our PDAugment can change the patterns of pitchand duration in original speech and effectively narrow thegap between natural speech and singing voice. To avoid thedistortion of adjusted speech, we limit the adjustment degreewithin a reasonable range, so the statistics of adjusted speech

Property Original Speech adjusted Speech

Pitch Range (semitone) 12.71 14.19Pitch Smoothness 0.93 0.69Duration Range (s) 0.44 0.59Duration Variance 0.01 0.05

Table 5: The differences of acoustic properties between orig-inal speech and adjusted speech.

can not completely match that of singing voice. Nonetheless,adjusted speech is still good enough for ALT model to cap-ture some music-specific characteristics.

Visualization of adjusted Speech In order to visuallydemonstrate the effect of PDAugment module, we plot thespectrograms of speech to compare the acoustic characteris-tics before and after different types of adjustments.

An example of our PDAugment output is shown in Fig-ure 4. In detail, the spectrogram of the original naturalspeech sample is shown in Figure 4a. And we illustrate thespectrograms of purely pitch adjusted, purely duration ad-justed, both pitch and duration adjusted speech in Figure 4b,Figure 4c, and Figure 4d separately. It is clear that the ad-justed speech audios have different acoustic properties fromnatural speech audios. More example spectrograms and au-dios can be found in the supplementary material.

6 Conclusion

In this paper, we proposed PDAugment, a data augmenta-tion method by adjusting pitch and duration to make betteruse of natural speech for ALT training. PDAugment trans-fers natural speech into singing voice domain by adjustingpitch and duration at syllable level under the instruction ofmusic scores. PDAugment module consists of speech-notealigner to align the speech with note, and two adjusters toadjust pitch and duration respectively. Experiments on twosinging voice datasets show that PDAugment can signifi-cantly reduce the WERs of ALT task. Our method analy-ses explore different types of augmentation, further verifythe effectiveness of PDAugment. In the future, we will con-sider narrowing the gap between natural speech and singingvoice from more aspects such as vibrato and try to add somemusic-specific constraints in the decoding stage.

Page 8: PDAugment: Data Augmentation by Pitch and Duration ...

ReferencesAli, S. O.; and Peynircioglu, Z. F. 2006. Songs and emo-tions: are lyrics and melodies equal partners? Psychology ofmusic, 34(4): 511–534.Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.;Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng,Q.; Chen, G.; et al. 2016. Deep speech 2: End-to-end speechrecognition in english and mandarin. In International con-ference on machine learning, 173–182. PMLR.Basak, S.; Agarwal, S.; Ganapathy, S.; and Takahashi, N.2021. End-to-end lyrics Recognition with Voice to SingingStyle Transfer. arXiv preprint arXiv:2102.08575.Chan, W.; Jaitly, N.; Le, Q.; and Vinyals, O. 2016. Lis-ten, attend and spell: A neural network for large vocabularyconversational speech recognition. In 2016 IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing (ICASSP), 4960–4964. IEEE.Dabike, G. R.; and Barker, J. 2019. Automatic Lyric Tran-scription from Karaoke Vocal Tracks: Resources and a Base-line System. In INTERSPEECH, 579–583.Demirel, E.; Ahlback, S.; and Dixon, S. 2020. AutomaticLyrics Transcription using Dilated Convolutional NeuralNetworks with Self-Attention. In 2020 International JointConference on Neural Networks (IJCNN), 1–8. IEEE.Fujihara, H.; Goto, M.; Ogata, J.; Komatani, K.; Ogata, T.;and Okuno, H. G. 2006. Automatic synchronization betweenlyrics and music CD recordings based on Viterbi alignmentof segregated vocal signals. In Eighth IEEE InternationalSymposium on Multimedia (ISM’06), 257–264. IEEE.Graves, A. 2012. Sequence transduction with recurrent neu-ral networks. arXiv preprint arXiv:1211.3711.Graves, A.; Fernandez, S.; Gomez, F.; and Schmidhuber, J.2006. Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks.In Proceedings of the 23rd international conference on Ma-chine learning, 369–376.Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu,J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. 2020. Con-former: Convolution-augmented Transformer for SpeechRecognition. arXiv preprint arXiv:2005.08100.Gupta, C.; Li, H.; and Wang, Y. 2018. Automatic Pronunci-ation Evaluation of Singing. In Interspeech, 1507–1511.Gupta, C.; Yılmaz, E.; and Li, H. 2020. Automatic LyricsAlignment and Transcription in Polyphonic Music: DoesBackground Music Help? In ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP), 496–500. IEEE.Hosoya, T.; Suzuki, M.; Ito, A.; Makino, S.; Smith, L. A.;Bainbridge, D.; and Witten, I. H. 2005. Lyrics Recognitionfrom a Singing Voice Based on Finite State Automaton forMusic Information Retrieval. In ISMIR, 532–535.Kearns, D. M. 2020. Does English Have Useful SyllableDivision Patterns? Reading Research Quarterly, 55: S145–S160.Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980.

Kruspe, A. M.; and Fraunhofer, I. 2014. Keyword Spottingin A-capella Singing. In ISMIR, volume 14, 271–276.Kruspe, A. M.; and Fraunhofer, I. 2015. Training PhonemeModels for Singing with” Songified” Speech Data. In IS-MIR, 336–342.Kruspe, A. M.; and Fraunhofer, I. 2016. Bootstrapping aSystem for Phoneme Recognition and Keyword Spotting inUnaccompanied Singing. In ISMIR, 358–364.Li, B.; Chang, S.-y.; Sainath, T. N.; Pang, R.; He, Y.;Strohman, T.; and Wu, Y. 2020. Towards fast and accuratestreaming end-to-end ASR. In ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP), 6069–6073. IEEE.Loscos, A.; Cano, P.; and Bonada, J. 1999. Low-delaysinging voice alignment to text. In ICMC, volume 11, 27–61.McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; andSonderegger, M. 2017. Montreal Forced Aligner: TrainableText-Speech Alignment Using Kaldi. In Interspeech, vol-ume 2017, 498–502.Mesaros, A. 2013. Singing voice identification and lyricstranscription for music information retrieval invited paper.In 2013 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD), 1–10. IEEE.Mesaros, A.; and Virtanen, T. 2008. Automatic alignmentof music audio and lyrics. In Proceedings of the 11th Int.Conference on Digital Audio Effects (DAFx-08).Mesaros, A.; and Virtanen, T. 2009. Adaptation of a speechrecognizer for singing voice. In 2009 17th European SignalProcessing Conference, 1779–1783. IEEE.Meseguer-Brocal, G.; Cohen-Hadria, A.; and Peeters, G.2019. Dali: A large dataset of synchronized audio, lyricsand notes, automatically created using teacher-student ma-chine learning paradigm. arXiv preprint arXiv:1906.10606.Morise, M.; Yokomori, F.; and Ozawa, K. 2016. WORLD:a vocoder-based high-quality speech synthesis system forreal-time applications. IEICE TRANSACTIONS on Infor-mation and Systems, 99(7): 1877–1884.Panayotov, V.; Chen, G.; Povey, D.; and Khudanpur, S. 2015.Librispeech: an asr corpus based on public domain audiobooks. In 2015 IEEE international conference on acoustics,speech and signal processing (ICASSP), 5206–5210. IEEE.Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.;Cubuk, E. D.; and Le, Q. V. 2019. Specaugment: A simpledata augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779.Tsai, C.-P.; Tuan, Y.-L.; and Lee, L.-s. 2018. Transcrib-ing lyrics from commercial song audio: the first step to-wards singing content processing. In 2018 IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing (ICASSP), 5749–5753. IEEE.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In Advances in neural informationprocessing systems, 5998–6008.

Page 9: PDAugment: Data Augmentation by Pitch and Duration ...

Wang, C.; Wu, Y.; Du, Y.; Li, J.; Liu, S.; Lu, L.; Ren, S.;Ye, G.; Zhao, S.; and Zhou, M. 2020. Semantic Mask forTransformer Based End-to-End Speech Recognition. Proc.Interspeech 2020, 971–975.Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba,J.; Unno, Y.; Soplin, N. E. Y.; Heymann, J.; Wiesner, M.;Chen, N.; et al. 2018. Espnet: End-to-end speech processingtoolkit. arXiv preprint arXiv:1804.00015.Xu, J.; Tan, X.; Ren, Y.; Qin, T.; Li, J.; Zhao, S.; and Liu,T.-Y. 2020. Lrspeech: Extremely low-resource speech syn-thesis and recognition. In Proceedings of the 26th ACMSIGKDD International Conference on Knowledge Discov-ery & Data Mining, 2802–2812.Zhang, C.; Ren, Y.; Tan, X.; Liu, J.; Zhang, K.; Qin,T.; Zhao, S.; and Liu, T.-Y. 2020a. Denoising Text toSpeech with Frame-Level Noise Modeling. arXiv preprintarXiv:2012.09547.Zhang, C.; Tan, X.; Ren, Y.; Qin, T.; Zhang, K.; and Liu,T.-Y. 2020b. UWSpeech: Speech to Speech Translation forUnwritten Languages. arXiv preprint arXiv:2006.07926.Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.;Koo, S.; and Kumar, S. 2020c. Transformer transducer: Astreamable speech recognition model with transformer en-coders and rnn-t loss. In ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing (ICASSP), 7829–7833. IEEE.