Lecture 1: Introduction & DSP - Columbia Universitydpwe/classes/e6820-2003-01/lectures/E6820-L01-intro+dsp.pdfLecture 1: Introduction & DSP Sound and information Course structure DSP
Post on 03-Jun-2020
7 Views
Preview:
Transcript
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 1: Introduction & DSP
Sound and information
Course structure
DSP review: Timescale modification
Dan Ellis <dpwe@ee.columbia.edu>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2003
1
2
3
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 2
Sound and information
• Sound is air pressure variation
• Transducers convert air pressure
↔↔↔↔
voltage
1
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 3
What use is sound?
• Footsteps examples:
• Hearing confers an evolutionary advantage
- useful information, complements vision- ...at a distance, in the dark, around corners- listeners are highly adapted to ‘natural sounds’
(including speech)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
time / s
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 4
The scope of audio processing
AUDIO
PROCESSING
natural
man-made
simple abstract
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 5
The acoustic communication chain
• Sound is an information bearer
• Received sound reflects source(s) plus effect of environment (channel)
message signal channel receiver decoder
!
synthesis audioprocessing recognition
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 6
Levels of abstraction
• Much processing concerns shifting between levels of abstraction
• Different representations serve different tasks
- separating aspects, making things explicit ...
sound p(t)
representation(e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 7
Course structure
• Goals:
- survey topics in sound analysis & processing- develop an intuition for sound signals- learn some specific technologies (esp. ASR)
• Course structure:
- weekly assignments (25%)- midterm exam (25%)- final project (50%)
• Text:
Speech and Audio Signal Processing
Ben Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7
2
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 8
Web-based
• Course website:
http://www.ee.columbia.edu/~dpwe/e6820/
for lecture notes, problem sets, examples, ...
• + student web pages for homework etc.
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 9
Course outline
Fundamentals
L1:DSP
L2:Acoustics
L3:Pattern
recognition
L4:Auditory
perception
Audio processing
L5:Signalmodels
L6:Music
analysis/synthesis
L7:Audio
compression
L8:Spatial sound& rendering
Speech recognition
L9:Speechfeatures
L10:Sequence
recognition
L11:Recognizer
training
L12:Systems &
applications
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 10
Weekly Assignments
• Research papers
- journal & conference publications- summarize & discuss in class- written summaries on web page
• Practical experiments
- MATLAB-based (+ Signal Processing Toolbox)- direct experience of sound processing- skills for project
• Book sections
+ questions from book
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 11
Final Project
• Most significant part of course (50% of grade)
• Oral proposals mid-semester; Presentations in final class+ website
• Scope
- practical (Matlab recommended)- identify a problem; try some solutions- evaluation
• Topic
- few restrictions within world of audio- investigate other resources- develop in discussion with me
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 12
Examples of past projects
• Detecting airplane noise
- e.g. for environment monitoring
• Separating speakers in recorded meetings
- based on dummy-head binaural cues
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 13
DSP review: Digital Signals
- sampling interval
T
,
sampling frequency
- quantizer
3
time
xd[n] = Q( xc(nT ) )
Discrete-time samplinglimits bandwidth
Discrete-levelquantization
limits dynamic range
T
ε
ΩT2πT------=
Q y( ) ε y εÚ⋅=
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 14
The speech signal: time domain
• Speech is a sequence of different sound types:
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42-0.1
0
0.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.921.9-0.05
0
0.05
2.42 2.44 2.46 2.48
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: smooth transition“watch”
Stop burst: transient“dime”
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 15
Timescale modification (TSM)
• Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly
- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?
• Why not just slow it down?
- ,
r
= slowdown factor
- equiv. to playback at a different sampling rate
xs t( ) xotr-- =
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 16
Time-domain TSM
• Problem: want to preserve
local
time structurebut alter
global
time structure
• Repeat segments
- but: artefacts from abrupt edges
• Cross-fade & overlap
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr---- L n+⋅+=
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
time / s
time / s
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 17
Synchronous Overlap-Add (SOLA)
• Idea: Allow some leeway in placing window to optimize alignment of waveforms
• Hence,
where Km chosen by cross-correlation:
1
2
Km maximizes alignment of 1 and 2
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr---- L n Km+ +⋅+=
Km
ym 1–
mL n+[ ] x mr---- L n K+ +⋅
n 0=
Nov∑
ym 1–
mL n+[ ]( )2
∑ x mr---- L n K+ +
2
∑-----------------------------------------------------------------------------------------------------------------------
0 K KU ≤ ≤argmax =
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 18
The Fourier domain
Fourier Series (periodic continuous x)
Fourier Transform (aperiodic continuous x)
x t( ) ck ejkΩ0t
⋅k∑=
ck1
2πT---------- x t( ) e
jkΩ0– t⋅ td
T 2Ú–
T 2Ú∫=
k1 2 3 5 6 74
|ck|1.0
1.5 1 0.5 0 0.5 1 1.5 1
0.5
0
0.5
t
x(t)
x t( )1
2π------ X jΩ( ) e
jΩt⋅ Ωd∫=
X jΩ( ) x t( ) ejΩt–
⋅ td∫=
0 0.002 0.004 0.006 0.008time / sec
level / dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000freq / Hz
-80
-60
-40
-20 |X(jΩ)|
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 19
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
Discrete Fourier Transform (N-point x)
x n[ ]1
2π------ X e
jω( ) e
jωn⋅ ωd
π–
π∫=
X ejω
( ) x n[ ] ejωn–
⋅∑=
n-1 1 2 3 4 5 6 7
0 π
|X(ejω)|
ω2π 3π 4π 5π
1
2
3
x [n]
x n[ ] X k[ ] ej2πkN
---------⋅
k∑=
X k[ ] x n[ ] ej2πkN
---------–⋅
n∑= k
|X(ejω)||X[k]|
k=1...
n1 2 3 4 5 6 7
x [n]
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 20
Sampling and aliasing
• Discrete-time signals equal the continuous time signal at discrete sampling instants:
• Sampling cannot represent rapid fluctuations
• Nyquist limit (ΩΩΩΩT/2) from periodic spectrum:
xd n[ ] xc nT( )=
0 1 2 3 4 5 6 7 8 9 10 1
0.5
0
0.5
1
ΩM2πT------+
Tn sin ΩMTn( )sin= n I∈∀
ΩΩM−ΩT ΩT −ΩM
ΩT - ΩM−ΩT + ΩM
Gp(jΩ)Ga(jΩ) “alias” of “baseband”
signal
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 21
Speech sounds in the Fourier domain
- dB = 20.log10(amplitude) = 10.log10(power)
• Voiced spectrum has pitch + formants
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000-100
-80
-60
-40
0 1000 2000 3000 4000-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0.1
0 1000 2000 3000-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000 2000 3000 4000-100
-80
-60
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: transition“watch”
Stop: transient“dime”
time domain frequency domain
time / s freq / Hz
ener
gy
/ dB
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 22
Short-time Fourier Transform
• Want to localize energy in both time and freq→break sound into short-time pieces
calculate DFT of each one
• Mathematically:
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
time / s
freq
/ H
z
short-timewindow
DFT
L 2L 3L
X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–( )N
--------------------------------( )–exp⋅ ⋅n 0=N 1–
∑=
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 23
The Spectrogram
• Plot STFT as a grayscale image:X k m,[ ]
time / s
time / s
freq
/ H
zintensity / dB
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 24
Time-frequency tradeoff
• Longer of window w[n] gains frequency resolution at cost of time resolution
1.4 1.6 1.8 2 2.2 2.4 2.6
freq
/ H
z
time / s0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2W
ind
ow
= 2
56 p
t“N
arro
wb
and
”W
ind
ow
= 3
2 p
t“W
ideb
and
”
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 25
Speech sounds on the Spectrogram
• Most popular speech visualization
• Wideband (short window) better than narrowband (long window) to see formants
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vo
wel
: pe
riodi
c“h
as”
Fri
c've
: ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p:
tran
sien
t“ d
ime”
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 26
TSM with the Spectrogram
• Just stretch out the spectrogram?
- how to resynthesize?
spectrogram is only
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Y k m,[ ]
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 27
The Phase Vocoder
• Timescale modification in the STFT domain
• Magnitude from ‘stretched’ spectrogram:
- e.g. by linear interpolation
• But preserve phase increment between slices:
- e.g. by discrete differentiator
• Does right thing for single sinusoid- keeps overlapped parts of sinusoid aligned
Y k m,[ ] X kmr----,=
θY k m,[ ] θX kmr----,=
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 28
General issues in TSM
• Time window- stretching a narrowband spectrogram
• Malleability of different sounds- vowels stretch well, stops lose nature
• Not a well-formed problem?- want to alter time without frequency
... but time and frequency are not separate!- ‘satisfying’ result is a subjective judgement→solution depends on auditory perception...
E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 29
Summary
• Information in sound- lots of it, multiple levels of abstraction
• Course overview- survey of audio processing topics- practicals, readings, project
• DSP review- digital signals, time domain- Fourier domain, STFT
• Timescale modification- properties of the speech signal- time-domain- phase vocoder
top related