Lecture 1: Introduction & DSPdsp-2up.pdf · 2006-01-19 · E6820 SAPR - Dan Ellis L01 - 2006-01-19 1/29 COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK EE E6820: Speech & Audio Processing
Post on 04-Jul-2020
3 Views
Preview:
Transcript
EE E6820: Speech & Audio Processing & Recognition
Lecture 1: Introduction & DSP
Sound and information
Course structure
DSP review: Timescale modification
Dan Ellis <dpwe@ee.columbia.edu>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2006
1
2
3
E6820 SAPR - Dan Ellis L01 - 2006-01-19 1/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Sound and information
• Sound is air pressure variation
• Transducers convert air pressure ↔↔↔↔ voltage
1
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
E6820 SAPR - Dan Ellis L01 - 2006-01-19 2/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
What use is sound?
• Footsteps examples:
• Hearing confers an evolutionary advantage- useful information, complements vision- ...at a distance, in the dark, around corners- listeners are highly adapted to ‘natural sounds’
(including speech)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
time / s
E6820 SAPR - Dan Ellis L01 - 2006-01-19 3/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The scope of audio processing
AUDIO
PROCESSING
natural
man-made
simple abstract
E6820 SAPR - Dan Ellis L01 - 2006-01-19 4/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
mes er
The acoustic communication chain
• Sound is an information bearer
• Received sound reflects source(s) plus effect of environment (channel)
sage signal channel receiver decod
!
synthesis audioprocessing recognition
E6820 SAPR - Dan Ellis L01 - 2006-01-19 5/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Levels of abstraction
• Much processing concerns shifting between levels of abstraction
• Different representations serve different tasks- separating aspects, making things explicit ...
sound p(t)
representation(e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
E6820 SAPR - Dan Ellis L01 - 2006-01-19 6/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Course structure
• Goals:- survey topics in sound analysis & processing- develop an intuition for sound signals- learn some specific technologies
• Course structure:- weekly assignments (25%)- midterm event (25%)- final project (50%)
• Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7
2
E6820 SAPR - Dan Ellis L01 - 2006-01-19 7/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Web-based
• Course website:http://www.ee.columbia.edu/~dpwe/e6820/
for lecture notes, problem sets, examples, ...
• + student web pages for homework etc.
E6820 SAPR - Dan Ellis L01 - 2006-01-19 8/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Course outline
Fundamentals
L1:DSP
L2:Acoustics
L3:Pattern
recognition
L4:Auditory
perception
Audio processing
L5:Signalmodels
L6:Music
analysis/synthesis
L7:Audio
compression
L8:Spatial sound& rendering
Applications
L9:Speech
recognition
L10:Music
retrieval
L11:Signal
separation
L12:Multimediaindexing
E6820 SAPR - Dan Ellis L01 - 2006-01-19 9/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Weekly Assignments
• Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page
• Practical experiments- MATLAB-based (+ Signal Processing Toolbox)- direct experience of sound processing- skills for project
• Book sections
E6820 SAPR - Dan Ellis L01 - 2006-01-19 10/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Final Project
• Most significant part of course (50% of grade)
• Oral proposals mid-semester; Presentations in final class+ website
• Scope- practical (Matlab recommended)- identify a problem; try some solutions- evaluation
• Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me
• Copying
E6820 SAPR - Dan Ellis L01 - 2006-01-19 11/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Examples of past projects
• Automatic Prosody Classification
• Model-based note transcription
E6820 SAPR - Dan Ellis L01 - 2006-01-19 12/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
DSP review: Digital Signals
- sampling interval T,
sampling frequency
- quantizer
3
time
xd[n] = Q( xc(nT ) )
Discrete-time samplinglimits bandwidth
Discrete-levelquantization
limits dynamic range
T
ε
ΩT2πT------=
Q y( ) ε y εÚ⋅=
E6820 SAPR - Dan Ellis L01 - 2006-01-19 13/29COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK
The speech signal: time domain
• Speech is a sequence of different sound types:
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42-0.1
0
0.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.921.9-0.05
0
0.05
2.42 2.44 2.46 2.48
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: smooth transition“watch”
Stop burst: transient“dime”
E6820 SAPR - Dan Ellis L01 - 2006-01-19 14/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
M
Timescale modification (TSM)
• Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?
• Why not just slow it down?
-
- equiv. to playback at a different sampling rate
xs t( ) xotr-- =
(>1 → slower), r = slowdown factor
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
r = 2
E6820 SAPR - Dan Ellis L01 - 2006-01-19 15/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Time-domain TSM
• Problem: want to preserve local time structurebut alter global time structure
• Repeat segments- but: artefacts from abrupt edges
• Cross-fade & overlap
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr---- L n+⋅+=
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
time / s
time / s
E6820 SAPR - Dan Ellis L01 - 2006-01-19 16/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Km
2
--------
M
Synchronous Overlap-Add (SOLA)
• Idea: Allow some leeway in placing window to optimize alignment of waveforms
• Hence,
where Km chosen by cross-correlation:
1
2
Km maximizes alignment of 1 and 2
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr---- L n Km+ +⋅+=
ym 1–
mL n+[ ] x mr---- L n K+ +⋅
n 0=
Nov∑
ym 1–
mL n+[ ]( )2
∑ x mr---- L n K+ +
∑
---------------------------------------------------------------------------------------------------------------0 K KU ≤ ≤
argmax =
E6820 SAPR - Dan Ellis L01 - 2006-01-19 17/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The Fourier domain
Fourier Series (periodic continuous x)
Fourier Transform (aperiodic continuous x)
x t( ) ck ejkΩ0t
⋅k∑=
ck1
2πT---------- x t( ) e
jkΩ0– t⋅ td
T 2Ú–
T 2Ú∫=
Ω02πT------=
k1 2 3 5 6 74
|ck|1.0
1.5 1 0.5 0 0.5 1 1.5 1
0.5
0
0.5
t
x(t)
1/k
x t( )1
2π------ X jΩ( ) e
jΩt⋅ Ωd∫=
X jΩ( ) x t( ) ejΩt–
⋅ td∫=
0 0.002 0.004 0.006 0.008time / sec
level / dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000freq / Hz
-80
-60
-40
-20 |X(jΩ)|
E6820 SAPR - Dan Ellis L01 - 2006-01-19 18/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
Discrete Fourier Transform (N-point x)
x n[ ]1
2π------ X e
jω( ) e
jωn⋅ ωd
π–
π∫=
X ejω
( ) x n[ ] ejωn–
⋅∑=
n-1 1 2 3 4 5 6 7
0 π
|X(ejω)|
ω2π 3π 4π 5π
1
2
3
x [n]
x n[ ] X k[ ] ej2πkn
N-------------
⋅k∑=
X k[ ] x n[ ] ej2πkn
N-------------–
⋅n∑= k
|X(ejω)||X[k]|
k=1...
n1 2 3 4 5 6 7
x [n]
E6820 SAPR - Dan Ellis L01 - 2006-01-19 19/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Sampling and aliasing
• Discrete-time signals equal the continuous time signal at discrete sampling instants:
• Sampling cannot represent rapid fluctuations
• Nyquist limit (ΩΩΩΩT/2) from periodic spectrum:
xd n[ ] xc nT( )=
0 1 2 3 4 5 6 7 8 9 10 1
0.5
0
0.5
1
ΩM2πT------+
Tn sin ΩMTn( )sin= n I∈∀
ΩΩM−ΩT ΩT −ΩM
ΩT - ΩM−ΩT + ΩM
Gp(jΩ)Ga(jΩ) “alias” of “baseband”
signal
E6820 SAPR - Dan Ellis L01 - 2006-01-19 20/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Fr
/ Hz
Speech sounds in the Fourier domain
- dB = 20·log10(amplitude) = 10·log10(power)
• Voiced spectrum has pitch + formants
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000-100
-80
-60
-40
0 1000 2000 3000 4000-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0.1
0 1000 2000 3000-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000 2000 3000 4000-100
-80
-60
Vowel: periodic“has”
icative: aperiodic“watch”
Glide: transition“watch”
Stop: transient“dime”
time domain frequency domain
time / s freq
ener
gy
/ dB
E6820 SAPR - Dan Ellis L01 - 2006-01-19 21/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
)--)
Short-time Fourier Transform
• Want to localize energy in both time and freq→break sound into short-time pieces
calculate DFT of each one
• Mathematically:
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
time / s
freq
/ H
z
k →
short-timewindow
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–(N
------------------------------(–exp⋅ ⋅n 0=N 1–
∑=
E6820 SAPR - Dan Ellis L01 - 2006-01-19 22/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The Spectrogram
• Plot STFT as a grayscale image:X k m,[ ]
time / s
time / s
freq
/ H
zintensity / dB
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000fr
eq /
Hz
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
E6820 SAPR - Dan Ellis L01 - 2006-01-19 23/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Time-frequency tradeoff
• Longer window w[n] gains frequency resolution at cost of time resolution
1.4 1.6 1.8 2 2.2 2.4 2.6
freq
/ H
z
time / s
level/ dB
0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2
Win
dow
= 2
56 p
tN
arro
wb
and
Win
dow
= 4
8 pt
Wid
eban
d
-50
-40
-30
-20
-10
0
10
E6820 SAPR - Dan Ellis L01 - 2006-01-19 24/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Speech sounds on the Spectrogram
• Most popular speech visualization
• Wideband (short window) better than narrowband (long window) to see formants
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vo
wel
: pe
riodi
c“h
as”
Fri
c've
: ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p:
tran
sien
t“ d
ime”
E6820 SAPR - Dan Ellis L01 - 2006-01-19 25/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
TSM with the Spectrogram
• Just stretch out the spectrogram?
- how to resynthesize?
spectrogram is only
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Y k m,[ ]
E6820 SAPR - Dan Ellis L01 - 2006-01-19 26/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
M
The Phase Vocoder
• Timescale modification in the STFT domain
• Magnitude from ‘stretched’ spectrogram:
- e.g. by linear interpolation
• But preserve phase increment between slices:
- e.g. by discrete differentiator
• Does right thing for single sinusoid- keeps overlapped parts of sinusoid aligned
Y k m,[ ] X kmr----,=
θY k m,[ ] θX kmr----,=
time
E6820 SAPR - Dan Ellis L01 - 2006-01-19 27/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
General issues in TSM
• Time window- stretching a narrowband spectrogram
• Malleability of different sounds- vowels stretch well, stops lose nature
• Not a well-formed problem?- want to alter time without frequency
... but time and frequency are not separate!- ‘satisfying’ result is a subjective judgement→solution depends on auditory perception...
E6820 SAPR - Dan Ellis L01 - 2006-01-19 28/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Summary
• Information in sound- lots of it, multiple levels of abstraction
• Course overview- survey of audio processing topics- practicals, readings, project
• DSP review- digital signals, time domain- Fourier domain, STFT
• Timescale modification- properties of the speech signal- time-domain- phase vocoder
E6820 SAPR - Dan Ellis L01 - 2006-01-19 29/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
top related