111/08/14 1 Query-by-Singing/Humming: An Overview 哼哼哼哼哼哼哼 「」 J.-S. Roger Jang ( 哼哼哼 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang
112/04/21 1
Query-by-Singing/Humming: An Overview「哼唱選歌」綜述
J.-S. Roger Jang ( 張智星 )
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
-2-
Outline
IntroductionMethods for QBSH
Pitch Tracking Database Comparison
Demos and Commercial ApplicationsConclusions
-3-
音樂資訊檢索( MIR )分類
Metadata-based Example: 歌名、歌手、標記、作詞者、作曲者 Query input: text or speech
Content-based Example: Melody, chord, note onsets, moods… Query input:
Symbolic: 音符、和弦、文字Acoustic: 哼唱、口哨、敲擊
-4-
Acoustic Inputs for MIR
哼唱 Query by humming
(usually “ta” or “da”) Query by singing
口哨 Query by whistling
敲擊 Query by tapping (at the
onsets of notes)
語音 Query by the user’s
speech input (for meta-data)
原音音樂範例 Query by recordings of
mobile phones
Beatboxing
-5-
Introduction to QBSH
QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song
database
Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX
-6-
「哼唱選歌」的流程
前處理: 收集單軌標準答案(通常是 MIDI 檔) 轉換成適合比對的中介格式
即時處理: 將使用者的音訊輸入轉成音高向量 由音高向量轉成音符(選擇性) 和標準答案進行比對 列出排名
-7-
Flowchart of QBSH
Pitch vectorsmoothing
Pitch tracking
Microphone input
Filtering
Query results(Ranked song list)
Similarity comparison
Off-line processing
Melody trackextraction
MIDI files
Frame-based representation
On-line processing
-8-
Pitch Tracking for QBSH
Two categories for pitch tracking algorithms Time domain ( 時域 )
ACF (Autocorrelation function)AMDF (Average magnitude difference function)SIFT (Simple inverse filtering tracking)
Frequency domain ( 頻域 )Harmonic product spectrum methodCepstrum method
-9-
Frame Blocking for Pitch Tracking
Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec
0 50 100 150 200 250 300-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
-10-
ACF: Auto-correlation Function
Frame s(i):
Shifted frame s(i+):
=30
30
acf(30) = inner product of overlap part
Pitch period
1
0
n
i
acf s i s i
-11-
Pitch Tracking via ACF
Specs Sampe rate = 11025 Hz Frame size = 32 ms Overlap = 0 Frame rate = 31.25
Playback soo.wav sooPitch.wav
-12-
AMDF: Average Magnitude Difference Function
Frame s(i):
Shifted frame s(i+):
=30
30
amdf(30) = sum of abs. difference
Pitch period
1
0
n
i
amdf s i s i
-13-13/44
UPDUDP (1/4)
UPDUDP: Unbroken Pitch Determination Using DP Goal: To take pitch smoothness into consideration
: a given path in the AMDF matrix : Number of frames : Transition penalty : Exponent of the transition difference
n
i
n
i
m
iiii pppamdfm1
1
11,,cost p
mn
ni ppp ,,1p
-14-
UPDUDP (2/4)
Optimum-value function D(i, j): the minimum cost starting from frame 1 to position (i, j)
Recurrent formula:
Initial conditions : Optimum cost :
160,8),(),1( 1 jjamdfjD
),(min
160,8jnD
j
2
160,8),1(min)(),( jkkiDjamdfjiD
ki
160,8,,1 jni
-15-
UPDUDP (3/4)
A typical example of UPDUDP using AMDF
-16-
UPDUDP (4/4)
Insensitivity in
0 0.5 1 1.5 2
-3
-2
-1
0
1
2
3
x 104
Wav
efor
m
xi
x i
lu
l u
chan
ch a nn
sheng
sh ng
chang
ch a ng
0 0.5 1 1.5 2
20
30
40
50
60
70
80
Time (seconds)
Pitc
h (S
emito
nes)
xi
x i
lu
l u
chan
ch a nn
sheng
sh ng
chang
ch a ng
=0
=2000 =4000 =6000 =8000 =10000 =12000 =14000 =16000 =18000 =20000
-17-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )
69440
log12 2
freqsemitone
-18-
Vectors after Pitch Tracking
With rests Without rests
-19-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音
-20-
Comparison of Pitch VectorsYellow line : Target pitch vector
-21-
Demo of Pitch Tracking
Real-time display of ACF for pitch tracking toolbox/sap/goPtByAcf.mdl
Real-time pitch tracking for real-time mic input toolbox/sap/goPtByAcf2.mdl
Pitch scaling pitchShiftDemo/project1.exe pitchShift-multirate/multirate.m
-22-
Comparison Methods of QBSH
Categories of approaches to QBSH Histogram/statistics-based Note vs. note
Edit distance
Frame vs. noteHMM
Frame vs. frameLinear scaling, DTW, recursive alignment
-23-
Range Comparison
Concept Reject a song if the range does not match:
Characteristics Extremely fast Not effective Good for initial filtering
)()( crangeqrange
-24-
Linear Scaling (LS)
Concept Scale the query linearly to match the candidates
Example:
-25-
Linear Scaling (II)
Strength One-shot for dealing
with key transposition Efficient and effective Indexing methods
available
Weakness Cannot deal with non-
uniform tempo variations
Typical mapping path
-26-
Linear Scaling (III)
Distance function for LS Normalized L1-norm Normalized L2-norm
Rest handling Extend previous non-zero
note
Alignment example
-27-
Dynamic Time Warping (DTW)
Goal: Allows comparison of high tolerance to tempo variation
Characteristics: Robust for irregular tempo variations Trial-and-error for dealing with key transposition Expensive in computation Does not conform to triangle inequality Some indexing algorithms do exist
#1 method for task 2 in QBSH/MIREX 2006
-28-
Dynamic Time Warping: Type 1
i
j
t(i-1)
r(j)
)1,2(
)1,1(
)2,1(
min
|)()(|),(
jiD
jiD
jiD
jritjiD
),( jiD
t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 27-45-63 degrees
DTW recurrence:r(j-1)
t(i)
-29-
Dynamic Time Warping: Type 2
i
j
t(i-1)
r(j)
),1(
)1,1(
)1,(
min
|)()(|),(
jiD
jiD
jiD
jritjiD
),( jiD
r(j-1)
t(i)
t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 0-45-90 degrees
DTW recurrence:
-30-
Local Path Constraints
Type 1: 27-45-63 local paths
Type 2: 0-45-90 local paths
jiD ,
jiD ,
),1(
)1,1(
)1,(
min
)()(),(
jiD
jiD
jiD
jritjiD
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
2,1 jiD
1, jiD
1,1 jiD
jiD ,1
1,1 jiD 1,2 jiD
-31-
DTW Paths of “Match Beginning”
We assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended song.
Right-end is free to move. Typical DTW table size =
128 x 180
i
j
-32-
DTW Paths of “Match Anywhere”
Both ends are free to move.
Typical DTW table size = 128 x 2880
i
j
-33-
DTW Path of “Match Beginning”
-34-
DTW Path of “Match Anywhere”
-35-
DTW Path of “Match Anywhere”
-37-
Key Transposition
Goal: Allow users’ input of different keys
Method 1: Mean shift and heuristic modification
5 DTW computation when compared to each song
Mean
-4 40-2 21 3
t-2t+2(t’)t’-1 t’+1t
-38-
Type-3 DTW:Frame to Note Alignment
DP-based method for filling the table:
67
64
65
Frame-levelPitch vector
Notes
)1,1(
),1(min|)()(|),(
jiD
jiDjritjiD
jiD ,
1,1 jiD
jiD ,1
Recurrent formula: Local constraint:
62
65
-39-
Type-3 DTW
Characteristics Frame-based query input
vs. note-based music database
Note duration unused More efficient, less
effective Heuristics for key-
transposition
Mapping path
-40-
RA (Recursive Alignment)
Characteristics Combine characteristics
of LS & DTW #1 method for task 1 in
QBSH/MIREX 2006
A typical mapping path
-41-
Modified Edit Distance
Note segmentation
Modified edit distance
,
)(}2),,....,,({
)(}2),,,....,({
)(),(
)(),(
)(),(
min
1,1
11,
1,1
1,
,1
,
ionfragmentatjkbbawd
ionconsolidatikbaawd
treplacemenbawd
insertionbwd
deletionawd
d
jkjikji
jikijki
jiji
jji
ji
ji
-42-
Challenges in QBSH Systems
Song database preparation MIDIs, singing clips, or audio music
Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar
Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs
-43-
-44-
Goal and Approach
Goal: To retrieve songs effectively within a given response time, say 5 seconds or so
Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification
-45-
Demo: MIRACLE
MIRACLE: Music Information Retrieval Acoustically via CLuster Engines
Demo page of MIR Lab: http://mirlab.org/new/mir_products.asp
MIRACLE demo: http://cuda.mirlab.org
-46-
Internet Music Search EngineClient-server distributed computingCloud computing via clustered PCs & GPU
Master server
Clients Clustered servers
PC
PDA
Cellular
Slave
Slave
Slave
Master server
Slave servers
Request: pitch vector
Response: search result
-47-
Challenge 1:音樂資料庫之收集
由網路收集之音樂檔案: MIDI檔案
若要精準,需由人工找出主旋律所在的軌數。若以自動化之方法來進行,辨識率約為 85%
MIDI 檔案格式複雜且不一致MIDI 主旋律不乾淨(有前奏、疊音、變奏等)
MP3檔案流行音樂:極不容易抽取人聲之音高。根據 ISMIR2011之比賽結果,最佳音高辨識率為 84%
交響樂:可能根本沒有主旋律 人工標記:
若要支援文字搜尋,則需加入歌手、歌詞、類別等資訊。
-48-
Challenge 2:比對之加速
影響比對速度之因素(及其代表值) 哼唱輸入長度: 8 秒( 128音高點) 資料庫大小:約 13000首歌 比對方法: LS+DTW CPU: Pentium 2G(比較不受到記憶體大小影響) 比對位置
從頭比對:約 2 秒從中間比對
• 副歌開始處• 每個音符開始處:約 45秒• 任意處:約 60秒
-49-
Response Time of Miracle
8 sec recording of “ 小毛驢” , comparison from beginning: LS: 0.4 sec DTW: 3.5 sec LS+DTW: 0.6 sec
8 sec recordings of the refrain of “ 夢醒時分” , comparison from anywhere: LS: 40 sec DTW: IIS time out LS+DTW: 45 sec NBDTW: IIS time out
-50-
Could It Be More Efficient?
Algorithms Indexing of LS/DTW Progressive filtering
New Platforms GPU (66 times faster for QBSH!) Grid/clustered computing Multi-core platforms
-51-
Commercial Applications
www.midomo.comwww.soundhound.comwww.shazam.com
-52-
Conclusions
QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature
Challenges How to deal with massive music databases? How to extract melody from audio music?