gtvaringbdquooacuteEzWeoacuteigrave+21icircVEgraveiquestUuml~xgtecircagraveE7acirc
_agraventilde0GHCCIAgravegtOtildeecircJKLMgtZj34 ampamp2ucircuumlregcopyuumlRaacute
slsquoOumlpartEgrave1OumlTUparaVBJ$Ucirc+atildeamp+IumlacuteSinthellipgt˚lsaquo=WX
0JmYƒasymp-ımicroVZraquoAgrave[finthellipgtntildewdaggerCDgeOumlngtiquest
7 C fıOumlCC8figraveacute]UumlIcircAgrave[ficircVZj_microbullfint
hellip 8sToagraveEcircordm89+_OumlCEcircordmOacuteOcirclaquoAlaquoA+_V
inthellipIcircŒacircBIcirc~HIgt Agraveta˜^b˙˚œEgraveiquest+ndashmdash
ldquoVIcircŒoacutePQgtxoacutePQgt13$Ucirc2ampszlig4PQxD
Uumligrave|iquestEgraveiquestUgraveaPQOumlmiddotUgravePQUacuteicircV
_`abcdefghiajdbkghlbgAgraveTmnodpbqrbfsteguvvgEgraveowrbefsfxgubbg|ojdbkgyz|g
Atildeob~dlfgusgAgraveofAumldAringdghiaCcediltdCcedilfxghiawrbefsEacutesdCcedilguNtildegmrsquoojdbkghlvgOumlUumlno
CcediltdCcedilfxghpaacuteg ojdbkghblgAcircznn_Vaizagraveacircaumluatildearingccedilzeacuteaumlagraveegraven
y89+_gtiquest fIcircŒgtecirc(+Ccedilgtamps+
iquest 7fgtxbullf +acute]BAgrave 6$8fVteuml˜y$UcircımicroiPQ
˜ +EgravexBasympplusmnVa]UumlfIcircŒgtecirc(ımicrob˙gtagraveAgraveTmacirc
agravemrsquoOumlUumlacircagraveAcirczacircacirc-daggerUumlfasympiacute +gtJAgraveZjigraveCcedilagrave AcirczacircN
leiquesttˆxagravemrsquoOumlUuml AcirczacircVecirccB34 amp++b˙a -3
Yzicircuacutehellipsagravegt˜SˆCdividereg89+_gteurolsaquoyPQuacutehellip]gtZjƒm
b˙asympAgravefigtPQs`oZjˇccediluacutehellipgt
lt6=DDplt=~flDDpltC=yenDD
igraveLicircOumlccedilEacute
yœMgtZjszlig4SUumlAtildeiumldaggerm^7J
igrave6icirc$Ucirc0acutesntildebcgt13bullUumlfPQgttBrsquo
divide^^V
igraveicircZv9ugraveigraveKamp+icirc6AexUumlfAgraveOuml
4OumlOuml5ograve+atilde=+_microgt7plsaquoƒeAtilde+_asympAgraveDOuml4+
ΩregOumlUcirc+ntildeoacuteTdivideVef+_microIcircŒgtxPIrsaquo_gt Agraveplsaquoasymprsaquo
gtcenttBiexclPQVecircOtildeUacuteIcirc
iumlntildersquooacuteiumlograveOumlocircugraveiumlograveOumloumlIgraveograveJiuml7otildedivideograveograveecirc(uacuteugravegt
vntildewZQucircuumldaggeriumlV
ydegiuml13Q13gtucircuumldaggeriuml13J
fraslmdashnecent13poundsect
yPQeraquoIuml13$Ucirc0CDlaquoyEgt$Ucirc0GHUEumlAgraveIcirc
Œ+nPIbullf 0qK˘iquestgtampByCE6ˆigrave4+ampampicircraquo
IumlgtcentFxUumlftˆiquestordfgtKoacutePQmiddotAringszligl]gt
7lCToFagraveEcircUEumlV
igraveCicirc(ampamp2ampumlszlig4lozoasympEacuteNtilde2ucircuumlgtbullparamacrtasymp 6bull
ucircuumlgtrparalozocircnotccedilEacuteVV
131IJKL
6$ ccedilE4iquestamp-- Ouml_Ograveuacutehellipgt
eQ˚CUumlszliggtFEgravexaPQrsaquoımicroicircBPQrsaquo
+igraveOogravelt=icircpoundxYreg˚aıPIb˙CcedilJPQ+Ecircordm
89+_uacutehellipigraveltC=amplt6=icircVYDamp
copy(+lt67=y4Ccedilgtv +tradedagger˘iquestaringx
+gtcentFxCŸacuteigraveampamp2icircOumlCumlneigraveamp+icirc
187
oq-gt+tUgraveasympreg-CwV
loziexcl+_microPQrsaquo+iquestAEligraquontildexEasympgtAumlqC`oZ
jUumlOslashˇccediligrave $6icircgteurolsaquoY34+_microPQrsaquo+gt Agraveˆ
vZraquouacutehellipfi-ecircDZjamp)iexclS-gtEgraveiquestQrSQ
otildesV
$ ˇExUuml˘iquestinfinplusmn+aeligigrave42+amp2ampicirc+atildegtOtildeecirc4
KzziquestUumlf3a=gtledagger423- OumlIacute] amp
ograveogravegtxyixIacutefagraveKacirc2BHKzgtaring1~yKMC
UumlCcediliquestIacuteumlEacuteNtildegtgeyenˇƒiquestUumlCcedilAgrave0szligmYregƒ+nasymplaquo
agravebullfJ64VOtildeecircaQefIcircŒ
-amp2amp4amp++2V
tBˆmicropartiexcls +amp3x4gtxyBeUacutefIcircŒgtiquestaacute
WiexclsfraslsumOgraveprodπV
intordfordmCcedilCDotildedivideszligszligfraslsumCD˜BoaV
1Ω˜ucircprodπV
ecentxatildearingccedilograveOumlEogravelt=rJaPVˇEy +-1
iiquestePBiquestCS Cgteurolsaquoatildearingccedilogravecentogravebc`gt34
aeliggtˆRRsKazxUumlfruAumlAringCcedilEacuteNtildeV
C$ a +-1sPQrsaquo +acute]QszligCSUacutefgtograve
lt=rSoslash]iquestocircoumlˇFfoq-˜gtOtildeecircmdashint
UcircBtradeBograveBaelig-OumlWXBmBlt=$$$ograveogravegteAtildeyoslashoumlucircAcircacircCxiquest
miexclucirc+Veurolsaquoszlig4 +-1EcircasympUuml amp2poundx89+
_aeliggtˆhQotildesV
13MNgtOP
ˇE]Qamp)Agraveˆy89JOacuteOcirclaquoA+_iquestAgraveS~gteurolsaquoZj3
4AacutelaquoAbcordmdaggerCDradicnelt=uacutelmicroVyJ
centUgraveiquestUumlAacutelaquoA_ordmdaggerUumlBaacuteUFlVAacute
laquoA+_gtraquontildexI~flWIOumlIyenWIOumlIyenW~flWIOuml
iquestampiquest IcircŒgtOtildeecircexmacr˙DnotOumlAring5ucircyœradicƒparaOumlasymp]iquestUuml
paraIcircucirc$$ograveogravepAacutelaquoA+_`gtzwgteurolsaquolaquoA+_
UumlAtildepleadaggerwordm4zgtOtildeecircI1y∆laquoraquo]hellip AgraveIgtIAtilde
_=OtildeŒotildeœndashSpartmdashIograveograveV
bcZjL413ampUumllsquopoundyengtaacuteW341jrparapound
yen7gtntildewAcircsPQAumlAringCcediligrave+++icircBJAumlAringigrave+ampicircOumlO
lozCcediligrave+22icircBJ HLBDgtq HEgravexdaggerdegfraslEumlyenV
13Q13
y6Agtordmloz34 01HIagraveasympPQgtyumlsUgraveaPQJmiddotUgravePQƒ
geyenszligpEcircaringldquooacuteszlig4ntildeoacutegtqrdquoUgraveaPQzBntildeoacuteaeligtszlig
cruVımicrooacutePQlUgraveaPQntildeoacutenlsquogtcent+HoacutePQ
_rsquosn+atilde˚gtelsquoxRˆuml cnrsaquosbquo|gtˇE
dagger+igraveicircdividegsPymacrMQigrave42icircgtmacr˘ 0HIxRl
01lozEcircsbquoeurolsaquoZj2radicƒPQBJHIasympkVltoacutePQOgrave
UgraveaPQƒ(Ograve+HndashtUEuml 13BBamp2$$+$$BBBamp2$4=
H[xOacuteOcirclaquoAFgtalsaquoHtB5Pe$flt=œuacute˘iquest)F2]]plusmngt
188
eurolsaquoZjlegAacutelaquoAbcVOtildexUumlAtildeAacutelaquoAIcircŒ
ZjhellipAgraveUumlatildeˆyumllV
ZŸAgraveUumlfˆfrasleuro-V
lsaquorsaquomacrUumluacutefiŒV
H[Oumll 01024igraveUgraveaPQicirc$vIgraveyOacuteOcirclaquoAaeligbc
4+amp3amp +++ +amp +22 4
Klt$$D=lt$$D=
24-lt$$6=01
6$6 9 $9 97$9 9$
Klt$$D=lt$$D=
-lt$$6=01
$9 9 $ 97$7 9$
Klt$$D=lt$$D=
24-lt$$6=0
$7 9 $7 97$9 9$ 7
Klt$$D=lt$$D=
-lt$$6=0
6$C 9 $C 97$ 9$
Klt$$D=lt$$D=
amp24-lt$$6=0
$66 9 $78 97$C 9$
ˇE6Aplsaquoamp)89sagravelt=ƒaoacute˘iquestrsaquocgteurolsaquoy
eraquoIumlAacutelaquoAgtZj2kUumlAtildeIcircŒiiquest 1~gtyenWOuml~
flWOumlyenW~flIacutegtIacute]UumlAtildeQbulloBJiquestampiquest IcircŒV
HflxefbcV
HflOuml010gAacutelaquoAlbc
4+amp3amp +++ +amp +22 4
Klt$$D=lt$$D=
24-lt$$6=
0
$ 97$78 9$77 99$6
Klt$$D=lt$$D=
24-lt$$6=
01
99$C 79$8 8$97 8$6C
aH[gtZjtBiexclsigrave6icircoacutePQ6iquest CfntildeoacutegtUgraveaPQ
iquest fntildeoacutegteurolsaquoCbdquoxBiquestzL4BntildeoacutepoundxToacuteacirc34Bntildeoacute
agravePQgttˆˇEoacutentildeDaggerordmlsquocopy-ntildeoacute_rsaquomiddototildesgtszlig˜L4
oacutePQHP˘iquestB4UgraveaPQHPrsaquogtBJigraveicircRaumlyOacuteOcirclaquoAUacuteHI
middotAringszligCccedil]gtaAacutelaquoAHflagraveiexclgtB 0HIbcl 01HIagrave
middotAringgtampBZjsbquoCcedileUacuteUFagravefraslmdashœMigraveicircV34
UgraveaPQ|UumlfrsaquorsquoxEcircZj4$Ucirc0ƒAacutelaquoA_asympKrsquodivideaeliggt
˜SPQ1389sagravelt=szlig4UumlszligVˇEZjaefAacutelaquoAbc5P
6$ lt=ƒQ∆laquoCordmbdquo+_iquestIUumlpermilDradicAcircmIJIUuml
EcircAacute-IVeUacuteIcircagraveUumlacircacircxSfUumlraquontildegtxKlozP
QdaggerUuml0B0qMlozdaggerUumlBVeurolsaquoyenMQK =0EumlEumlfraslagrave
3˜partntildeŸfraslVˇEZjlt=BvKMUgraveUacutefIacuteumlvIgravegtyAacute
laquoAiagraveiquestUumluacuteˆEgraveIacuteIcirccentIuml˚IgraveacircagraveUumluacuteacircFiquestOacutetˆgt
eurodaggeryyenMQUacutefacircEgraveCeumlOcircgeV
$ szligiquestUumlAtildeIcircŒiquestXocircoumloacuteG~igraveOtildeecircˆEgraveIacuteEgtˆyumllicircgt
xlt=acutesbcIacuteCccedilDgtqZj5PepoundDaggerxeurodaggerIˆEgraveIacuteIy89
sP=daggerUumlraquontilde4gtqIyumlIy89iBiquestEcircz4V
189
eurolsaquoZjEumlEuml
6$ OgraveKMiexclCDQszligJ+ndashvIgraveordmdaggervIgravep
$ SfNOJ+ndashvIgraveCgeDaeligyV
HUacutexvIgrave`UcircBJaring+Hndashp
RBC4STA9UV WXYZA[
+H`Ucirc +Hndash +H`Ucirc +Hndash
K f 01 34 01HIPQ
f 0 0 34 0HIPQ
- f PQ fy ampUgrave
24 UgraveaPQ PQ
H 4amp 1 3+
HıxCDlt=ampszlig4vIgraveogtˇEeAtildeoyOacuteOcirclaquoAbcacircpartntildeccedilHgtamp
BCvFUumlUumlOtildesp
]^_8`TA9UZa
amp26 HK$$KD$$D-$$-6
amp2 HK$$KD$$D$$D-$$-6
amp2C HK $$KD $$D-$$-6
amp2 HK6$$KD66$$D6-$$-6
amp2 HK$$D-$$-6
amp27 1K$$KD$$D-$$-6
amp28 H$$D-$$-6
HpartxUgraveflt=AacutelaquoAbcgtaAacutelaquoAacutesagraveiexclbullflt=acirciquest)C
˛laquogtagravem-acirce=+n˘iquestUumlflt=AumlAringVamp2 Agrave+_micro
aacuteraquo˚œgtIacuteumlAgravebullfyampUgraveigravetBtradedagger+ndashvIgravepoundoacutewicircEcirc
asympUumlfvIgravegtRaumllamp26 Hrsaquo] $9fntildelgt89lt=aeliguacuteIacutecentsectIacute
dagger 6$9europamp2CUEumlK$fJMUacutefgtRaumlƒ]QrSyenQ
oiquestAtildedivideUacutegtxƒaring1otildes˘iquestordmgeyenp3+ltamp27=y+2amp
raquoIumlgtRaumliquestmacrHxyAacutelaquoAraquoIumlIacute˘iquestvwrsaquoHPV
b^_MNgtOcdDE
amp2 +++lt= +amplt= +22lt= Hlt=
amp26 $ 97$78 9$77 99$6
amp2 C$ C 98$78 $7C 99$8
amp2C 6$ 9C$C8 98$C 9$
amp2 6$6 97$6 99$7 98$6
amp2 9$ 96$8 96$6 96$
amp27 $C 9 $89 9$C 9$7
amp28 $ 9$ 9$ 9$
e13AacutelaquoAIcircŒgtiquestˆigravepoundPQltamp=ˆOtildeicircxŸbullgtIacuteumlC
DUcirc+pound|+gtpoundxBCD+ˆogtxlaquoAlt=ƒCD~+IcircmicroAring
190
szligJ˜CcedilszligV
AgraveE89+_efotildesœgty89+_Q AgraveœKrSsectBgtOuml
aOuml^^eAtildeoslashucircAcircPgty+_microix˘iquestPQpˇE+_micro
-xQsbquolsquomacr-IcircŒgtSAtilde4pound+IcircgteurodaggersYEDUumlsbquolsquoqŸbulls
PˆBgtoslash89+_Cecircugravene=MF2]˜VRaumlyef+_microiiquest[
˘IcircIcircŒgtxeurodaggerntildeIcircxBP`Ucircdagger˙gtampBiquestˆIcircŒaringBasympgt
poundBˇ-gtpoundlAumloumlordf˚cedilagravegtiquestaiexclsb˙˚gtAumlqtˆ3
-ntildeoacuteparaigraveampicircVOtildeecircagrave˝˝amp˛ CˇacircefIcircŒy+_microKntilde
daggeragrave˝˝BampacircOgraveagrave˛ BCBBˇacircUacuteIcircgtoslashucircAcircFcedilEgravexBQIcircŒdagger
degV$UcircIcircımicroJPQˇElsaquo=mYƒgtampBccedilEcircmiddotAringOumltszliggtxyoslashlaquoAaelig5PgtiquestAtildePQ34$Ucirc0ndashmdashsagravebcgtˇ
Entilde0ntildepoundPQOacutepoundaringx˙eurogtCsPVOtildeecircgty+_microiquestUumloacute
ecircWXigraveicircBOcirc2igrave icircBJigraveicircBAtildeigrave icircgtq$Ucircugrave˘
iquestaeligsalt=PgteurolsaquoMbcdaggeragraveWXlt=Ocirc2lt=
Jlt=Atildelt=acircpy89gtiquestagraveUacuteigravegticircBlozigraveicircBDograveŸfraslacircelsquoIcirc
Œgt$UcircAElig]0bcxgtagraveUacutelozigraveicircBDograveŸfraslacircgtPsAacutelaquoA
OgraveOacuteOcirclaquoA+_iquestUumlCcedil]˝V
aumlqefFIgt34rsquodivide+aeliggtBiquest89+
_tBK34aeliggt0HIOumlUgraveaoacutePQOumlEacuteauml1[JK
MUacutefNOgtEgraveiquest1jigraveampicircxmacrrsaquovIgraveoVcenteurolsaquoZjsbquoCcedilemf
vIgravegtEcirc-Uumlf 1vIgraveV
igrave[icirc2
ZjƒefntildewasympAgraveOacuteOcircJAacutelaquoAVˇElsaquorsquoXŸyfifraslazgt
eraquoIumlAacutelaquoA+_ Agrave]flaquoA+_gtZjEgraveIacuteumlCsPy89+_azIcircŒgtZraquolaquoA+_ VHpartUumlxefb
cgtxvKUumlfasympiacuteFsFordmdaggerUumlfvIgraveVyOacuteOcirclaquoAgt
bc Hdagger 9$ 7gtRaumlBlordmsAgrave $V
yAacutelaquoAraquoIumlgt)l HsAgrave 9$8œ
VEcircZjB_=lsaquoordm34^PQ_wOacuteOcirclaquoAi
zoacutewaeligiquestUacute-AumlAringCcediligraveOacuteKAumlBJ1agraveAumlAringccediluAumlAringicircgtx
_=ntildeoacutelaquobcœMgteraquoIumlIacuteBiquest fiflgtfraslecircrdquoiquestmiddot4ntilde
oacuteagraves˙RSvIgraveŸfraslXdaggerUumlfrspˇEledaggerzyımicrooacutePQdaggerampBOacuteOcirclaquoA+_CπfraslEumlEumlHQ]daggerWz~V
AacutelaquoAiquest 6fazgtAumlArings[gtq
sıgtldquo 6iquest 8yordm34^PQ_zoacutew
aeligntildeoacutexAumlAringgtlsaquo=ntildeoacutentildeoacuteœM3-Oacute~V
bBCDE
-+++ +amp +22 H
efgtO
$7 9 $7 97$9 9$ 7
6$87 96$86 97$ 9C$9
$6 9 $9 97$7 9$8C
MNgtO
9$C 78$C6 8$ 8
191
9C$9C 7C$7 7$C 77
6$76 87$ 8 96$ 89$8
gtZjB4Agrave=agrave+_ivCcedilKMasympwgt
e13yivKMiacutentildeoacutelt=ordmdaggervIgraveiquestrdquoCDP
ˇEy89+_eAtilde3azdaggergtcentFx89+_ordmJp|Icirc
ŒKMocircoumltˆBxlozrsaquoyEcircIcircsPygtqCxpermilAuml7iquestwordm4-ntildegtecircZjtˆπfraslyenEpoundEoacuteagravedUumlfodaggerbullogtxy+_microefKQxfsumprodgtecircDZj=agraveIcircŒcentiquestocircoumlx
iquestyZjugravene=0ˆKFagraveUEumlVeurolsaquo-cEgravexl
rsaquoV|ƒEKOgrave(ampamp2eUacutef+_microl]UgraveiquestsectVK
_CiquestEgraveiquestoacutePQocircoumlFagraveaacuteWszlig4gtqa(ampamp2˜Sxgtgeyen˚lsaquo=rsquodividepKIcircŒlsaquo=13kgt~asymp∆gtq ]notEgraveAgraveEacuteNtilde2tˆyPsOumlhellipbpounducircuumlSnotŸbullordmgttˆiquestvCcedil
IcircŒgtPyˆmdashOumlpoundKˆmreg4=gtmacr˘sPyucircuumlnot]ˆbullacircyDUumlbull_gtcentFxDUumlfIcircŒgtq7yEacuteaumlKQsPgtxyZjœOumlpoundiquestumlZj34gtntildew~-Oacute_ordmOumlˆ4_ordmJ
ordfVeurolsaquoiquestyZucircuumlOumlzaacutegtiquest˛C`oZlt_pyK
iquestRS˙ZJCToKIgravey+_microZgta(ampamp2˜S_˘iqueste
fotildesV
QDghi
Zj34CDvIgravev˙1lt=IacuteBfifl=Mgt34ypart=Aacute
laquoAiquest 8 4gty34^PQ=notEcircordmvIgraveœMgtvlt= 4ru
le 89$8gtl˙1sAgrave 9$8VRaumllt= 4amp+EgravexS89+_ge
yenszlig˜bcC˜Ccedilgtiquest13C`o~5ucircgtTagraveampgtZjrs34fl4Agrave ]bEumlt˜7AringsectfraslcV
1sbquobdquoGyElt6=RaumlUumlaacutexDaggerYauml+atildersquodividersŸfraslghgtyQ
˘iquestiexclSordmccedilEacute$VˇEb˙ccedilEaringx+atildeb˙acircfrasl
bullparagteurolsaquo1BrdquoaumlEgt7d)œK4yoacuteGordmBJyaring1
+atildeKsumL4vIgravegtCGZTo4y+atilde]VqZjcentRSAgravenTovIgraveV
lt=ZjrsUumlfUgraveasympgtfiflAgrave_rsaquomiddotlt
=JBecirc589+_aeligiquesthotildesV7q˙1 +ccedillœr
CJmicroAringszligJ4VƒEUumlAtildeYauml+atildersquodividegtπfrasl34lOtildemiddot4gt
OtildeecircIcircŒ0 Ouml+ndashmdashldquo +amp2232Oumliumlntildeoacute +amp)amp
ograveogravegtacirciquest]divideUacuteV^agraveZj)ˆCRSnrsaquovIgravegtIacute]gtBhnQ
bulloBJiquestUacutefB]azbulloV|gtˇE ]yen^PQ
=_bEumlt˜gteurolsaquoZjcent)ˆrsocirclt=permilAcircgtBEJyenmPQ=89+_V
jk
6Aatildearingccedilgt()gteacuteegraveecircigraveicirc34gyent g$VpartflYauml+atilde13+Iumlrsquodivide$(V$C68CCAamp-ampampG$lt=$gtamp4amp+amp4amp
192
4+amp$amp+amp4amp$6 6 $
CAampE$lt6=$gtamp3amp2
amp$amp+amp4CC2amp4amp+amp4ampampamp2
+lt$69Q67=$
Aamp-ampampG$lt6=$amp+amp+$
amp+amp46$66$
13BB+$ampBRBamp4BB
AF$lt =134ampamp+amp
+$02A$232$
13BB$+$$$BR+G2B23B
7A(OampampKO$
1ampampampamp2$ampamp2+
amp+amp2$66amp$F7$666C7
8A$$amp-$$4amp+2amp4amp
+$amp+amp42amp4amp+amp4ampampamp2
+lt=6$
9A23ampamp
13BB$3$3$$B23ampampB
AOampG2E+36 8
13BB$+$+$BRG)B3B2S$4
6A+(G-gt0-0T+
3E+2)4ampUFamp2amp4
amp6lt6=13lt6C =
66Aamp-13BB$+$$+$3B+amp22B+B+amp222$
6A-+13BB$+$+amp$
193
ൎࢤଃᙃᢝխ౨ၦઌᣂᐛհإڤߜޏ֏ݾऱઔߒʳ
Study of the Improved Normalization Techniques of Energy-Related Features for Robust Speech Recognition
ᑰڜٳʳ Chi-an Pan ഏمዄতഏᎾՕᖂሽᖲՠᖂߓʳ
Dept of Electrical Engineering National Chi Nan University Taiwan s95323544ncnuedutw
壁ʳWen-hsiang TuޙഏمዄতഏᎾՕᖂሽᖲՠᖂߓʳ
Dept of Electrical Engineering National Chi Nan University Taiwan aero3016ms45hinetnet
ʳݳ Jeih-weih Hung ഏمዄতഏᎾՕᖂሽᖲՠᖂߓʳ
Dept of Electrical Engineering National Chi Nan University Taiwan jwhungncnuedutw
ኴ
ᙟထઝݾऱ୶Δ೯ଃᙃᢝݾՈດዬګᑵΔሒኔᎾᚨشऱၸΙᅝԫ
೯ଃᙃᢝߓอشࠌኔᛩቼխழΔᄎᠧಛऱեឫΔທګᙃᢝՕऱՀ
ΙڼڂΔᛩቼઌᣂऱଃൎݾ؆ૹΖءᓵਢಾኙࢤګףڇᠧಛࢬທ
ଃᐛհڍ൶ಘऱᓰᠲΔೈԱᄗ૪อհಝᒭፖᙃᢝᛩቼլߓऱᙃᢝګ
ൎࢤݾ؆Δૹរڇտฯݺଚࢬᄅ୶ऱ౨ၦઌᣂᐛൎ֏ዝጩऄΫᙩ
ଃᐛإ֏ऄΖڼڇΔݺଚאለᣤऱᑇᖂΔ൶ಘࢤګףᠧಛኙ౨ၦઌᣂᐛທ
֏إᙩଃᐛܛΔݾଚ୶ઌኙᚨऱԫᄅݺΔࠄటΙ൷ထᖕຍ؈ऱګ
ऄΔࠐຍ؈ࠄటΖຘመຍԫ٨ߓऱᙃᢝኔΔᢞኔݺଚנ༽ࢬऱᄅݾ౨ജڶய༼
ΖࢤګףऱړߜڶݾࢤൎڍፖࠀᠧಛᛩቼՀऱଃᙃᢝΔࢤګףጟٺ
Abstract The rapid development of speech processing techniques has made themselves
successfully applied in more and more applications such as automatic dialing voice-based information retrieval and identity authentication However some unexpected variations in speech signals deteriorate the performance of a speech processing system and thus relatively limit its application range Among these variations the environmental mismatch caused by the embedded noise in the speech signal is the major concern of this paper In this paper we provide a more rigorous mathematical analysis for the effects of the additive noise on two energy-related speech features ie the logarithmic energy (logE) and the zeroth cepstral coefficient (c0) Then based on these effects we propose a new feature compensation scheme named silence feature normalization (SFN) in order to improve the noise robustness of the above two features for speech recognition It is shown that regardless of its simplicity in implementation SFN brings about very significant improvement in noisy speech recognition and it behaves better than many well-known feature normalization approaches Furthermore
194
SFN can be easily integrated with other noise robustness techniques to achieve an even better recognition accuracy
ᣂΚ೯ଃᙃᢝΕኙᑇ౨ၦᐛΕรሿፂଙᢜᐛএᑇΕൎࢤଃᐛ Keywords speech recognition logarithmic energy feature the zeroth cepstral coefficient robust speech features
ԫΕፃᓵ २ࠐڣઝݾ୶ຒΔਢ೯ଃᙃᢝսਢԫઌᅝڶਗᖏࢤऱᓰᠲΖຏ
ԫ೯ଃᙃᢝߓอڇլڇ؆ᠧಛեឫऱઔߒᛩቼՀΔຟאױᛧᄕऱᙃᢝய
౨ΔૉਢᚨشኔᎾऱᛩቼխΔߓอᙃᢝய౨ঞຏᄎՕΔຍਢኔᛩ
ቼխڍऱฆࢤ(variation)ࢬᐙΖଃᙃᢝऱฆࢤጟᣊڍΔڕࠏಝᒭᛩቼፖᇢᛩቼڇژऱᛩቼլ(environmental mismatch)Εฆ(speaker variation)אଃऱฆ(pronunciation variation)ΖኙᛩቼլߢΔઌᣂऱᑇױᄗฃՀ٨ႈᣊীΚࢤګףᠧಛ(additive noise)Εኹᗨࢤᠧಛ(convolutional noise)אᐈऱ(bandwidth limitation)ΖቹԫଃಛᇆᠧಛեឫհقრቹΖʳ
ᖲ ဩሐ
ଃಛᇆ
ᠧಛࢤګף ኹᗨࢤᠧಛ
ᠧಛଃಛᇆ
ʳቹԫΕଃᠧಛեឫհقრቹʳ
ʳ
ᠲΔ൶ಘऱΔڂᠧಛࢤګףխऱऱᛩቼլ༽ࢬՂ૪אᓵਢء
ଚᆖૠጩଃݺᨏழΔޡࢼᐛᑇڇΖᠧಛኙଃᙃᢝऱᐙࢤګףཚലא
ऱ౨ၦଖᐛհԫΙᖕመװऱਐ[2][1]נΔଃಛᇆऱ౨ၦᐛ(energy feature)ᖕՂ૪౨ၦאࢬΖ౨ၦᐛऱૠጩᓤᠧ৫ৰᐛΔऱᙃᢝᇷಛՕመץࢬ
ᐛऱᚌႨΔءڇᓵխΔݺଚኙൎݾࢤאףΕಘᓵፖ୶Ζʳ
२ࠐڣΔڶפګڍऱൎࢤኙᑇ౨ၦᐛ(logarithmic energy logE)ऱݾઌᤉ༼֏ऄإΔኙᑇ౨ၦ೯ኪᒤڕࠏΔנ (log-energy dynamic range normalization LEDRN)[3]ؾᑑਢࠌಝᒭፖᇢऱଃᇷறኙᑇ౨ၦଖհ೯ኪᒤԫી֏Ιኙᑇ౨ၦ৫ૹऄ(log-energy rescaling normalization LERN)[4]ঞਢലኙᑇ౨ၦᐛଊՂԫଡտ 0ፖ 1ऱᦞૹଖΔᇢቹૹ৬נଃऱኙᑇ౨ၦᐛΙءኔ٣ছࢬ༼ॺܒ֏ऄ(silence energy normalization SLEN)[5]Δਢലإऱᙩଃଃኙᑇ౨ၦנଃଃ(non-speech frame)ऱኙᑇ౨ၦᐛԫᄕଖऱᑇΖՂ૪ऱԿጟऄΔઃႜٻലॺଃຝऱኙᑇ౨ၦᑇଖᓳΔࠀലଃຝऱኙᑇ౨ၦଖঅլΙ
ᠧ୲ޓऱຝຏᄎ౨ၦለऱຝਢԫଃᐛխΔ౨ၦለڂऱ
ಛऱᐙΖءᓵࠉᖕছԳࢬऱޏאףΔಾኙଃಛᇆ౨ၦઌᣂऱᐛڕ
Δጠψᙩݾԫᄅऱൎנ༽ࠀΔאףለᣤऱᑇᖂᓵאᠧಛᐙΔ۶
ଃᐛإ֏ऄω (silence feature normalization SFN)Δڼऄڶאױயࢤګףچᠧಛኙଃ౨ၦઌᣂᐛऱեឫΔ༼ߓอऱᙃᢝய౨Ζ
195
ᠧಛᐙऱயലኙ౨ၦઌᣂᐛଚ٣ݺรԲխΔڇՀΚڕᓵᆏᄗء
ᚨΔԫޡऱፖ൶ಘΔ൷ထտฯءᓵࢬᄅ༼נऱհᙩଃᐛإ֏ऄ(SFN)Ι รԿץԱٺጟಾኙ౨ၦઌᣂᐛհݾऱଃᙃᢝኔᑇᖕઌᣂಘᓵΔ խ
ೈԱտฯଃᙃᢝኔᛩቼ؆Δਢေ۷ᙩଃᐛإ֏ऄऱய౨Δࠀፖהऄ
ለΔ៶ڼᢞݺଚנ༽ࢬᄅऄ౨ڶய༼౨ၦઌᣂᐛڇᠧಛᛩቼՀऱൎࢤΖڇ
รխΔݺଚቫᇢലࢬ༼ऱᄅऄٽऱൎࢤᐛݾΔኙڼᣊऱٽᙃᢝ
ኔࢬऱᙃᢝאף൶ಘፖ Δאᢞݺଚנ༽ࢬऱᙩଃᐛإ֏ऄਢܡፖ
୶ඨΖࠐآᓵᓵፖءΖรնঞࢤګףऱړߜڶݾ ԲΕᙩଃᐛإ֏ऄ
ଈ٣ΔݺଚڇรԫᆏխΔಾኙଃ౨ၦઌᣂᐛΚኙᑇ౨ၦ(logarithmic energy logE)ፖรሿፂଙᢜএᑇ(c0)ᛩቼᠧಛեឫऱฆለԵऱᨠኘፖ൶ಘΔ൷ထڇรԲᆏխΔݺଚᖕຍࠄΔ༼נᙩଃᐛإ֏ऄऱᄅൎݾࢤΖʳ
ΰԫαኙᑇ౨ၦᐛรሿፂଙᢜᐛএᑇࢤګףᠧಛեឫհऱ൶ಘ
౨ၦઌᣂᐛ(logEᠧಛኙࢤګף ፖ c0)ທګऱயᚨطױቹԲנጤଧΖቹԲ(a)Ε(b)ፖ(c)قԫଃಛᇆʻAurora-20ᇷறխऱMAH_1390AᚾʼऱݮቹΕኙᑇ౨ၦ(logE)ڴᒵቹፖรሿፂଙᢜᐛএᑇ(c0)ڴᒵቹΙ(b)ፖ(c)խદۥኔᒵΕጸۥဠᒵፖ៴ۥរᒵঞଃΕಛᠧ 15dBऱଃಛᠧ 5dBऱଃࢬኙᚨऱڴᒵΖطຍԿቹխΔאױৰچנΔڶڇଃڇژऱΔlogEፖ c0ᐛଖለՕΔለլ୲ᠧಛऱᐙ؈టΔᙟழՂՀᛯऱउለΙհΔڇ
ଃڇژऱΔᐛଖছ৵֏ለᒷΔᠧಛऱեឫ৵Δଖᄎৰچޏ
ڍΖ൷ՀࠐΔݺଚאለᣤऱᑇᖂᓵΔኙאՂጟ؈టאףፖ൶ಘΖ
ଈ٣Δݺଚ൶ಘࢤګףᠧಛኙ logE ᐛऱᐙΖԫࢤګףᠧಛեឫऱଃ(noisy speech)խΔรnଡଃऱಛᇆ [ ]
nx m Κقױ
[ ] [ ] [ ]n n nx m s m d m Δ (1-2)ڤ
խ [ ]ns m ፖ
nd m قรnଡଃհଃಛᇆ(clean speech)אᠧಛ(noise)Δ
ঞڼଃհ logEᐛଖشױՀڤقΚ 2 2 2
log [ ] log [ ] [ ]x
m n m n m nE n x m s m d m
log exp expds
E n E n Δ (2-2)ڤ
խ xE n Ε s
E n ፖ dE n
nx m Ε
ns m א
nd m ኙᚨհࢬ logE ᐛଖΖ
ᖄીᠧಛଃፖଃಛᇆࢬᠧಛեឫΔڼڂ logE ᐛऱฆ E n ױ
ΚقڤՀشlog 1 exp
dx s sE n E n E n E n E n Ζ (3-2)ڤ
)ऱᠧಛ౨ၦٵઌڇΔૉנᨠኘױ(3-2)ڤط dE n )ՀΔڼฆଖ E n ፖଃಛᇆ
հ sE n ઌᣂऱᣂএΔᅝܧ s
E n ყՕழΔ E n ყΔհঞყՕΖᖕՂ
૪ऱංᖄΔאױנԫᠧಛଃಛᇆխΔڶଃګऱଃ( sE n ለՕ)ઌለᠧ
ಛଃ( [ ]s
E n ለ)ߢΔ logEᐛᠧಛᐙऱउለ(؈ܛటၦ E n ለ)Ζ ൷ՀࠐΔݺଚ൶ಘࢤګףᠧಛኙଃಛᇆऱ logEᐛ٨ݧᓳᢜ(modulation
spectrum)ՂऱᐙΖଈ٣Δݺଚലא(2-2)ڤ೬ᑇ(Taylor series)୶Δ୶ऱխរ 0 0
dsE n E n Δ୶ၸᐋ 2ၸΔقࢬ(4-2)ڤڕΚ
196
ʳቹԲΕڇլٵ SNRՀΔԫଃಛᇆհݮቹ౨ၦઌᣂᐛழ٨ݧቹΔխ(a)
ଃݮΕ(b) logEᐛڴᒵΕ(c) c0ᐛڴᒵʳ
log exp expdx s
E n E n E n 221 1
log 22 8
d d ds s sE n E n E n E n E n E n (4-2)ڤ
٨ݧᠧಛଃऱኙᑇ౨ၦڼᆺངΔঞمແ(4-2)ڤΔૉലՂڼڂ xE n ऱᓳ
ᢜشױՀڤقΚ 1
2 log 22
X j S j D j
1
16S j S j D j D j S j D j Δ (5-2)ڤ
խXڤ j ΕS j Dא j ᠧಛଃհlogE٨ݧ xE n ΕଃհlogE
٨ݧ sE n ፖᠧಛհlogE٨ݧ d
E n ऱᓳᢜΖ sE n ፖ d
E n ٨ݧ
ઃຏ (low-pass)ಛᇆΔsB ፖ
dB ઌኙᚨհᐈ (bandwidth)Δঞڤ (2-5)խ
D j D j ፖS j D j ႈऱᐈ2dB ፖ
s dB B Ιຍრထᠧಛଃ
հlogE٨ݧ xE n ઌለᠧಛऱlogE٨ݧ d
E n ലᖑޓڶՕऱᐈΖངߢհΔኙlogEԫᠧಛڇᇞᤩ۶אױΙຍګऱᓳᢜڍለڶΔᠧಛଃᠧಛᖑߢ٨ݧ
(a)
(a)
(b)
(c)
197
ଃಛᇆխڶଃऱΔದᠧಛऱದࠐᛯݮ(fluctuating)ޓΖ ൷ထݺଚ൶ಘࢤګףᠧಛኙ c0ᐛऱᐙΖᠧಛଃխรnଡଃऱ c0
ᐛଖא0
xc n قΔ
0
sc n ፖ
0
dc n ڼقଃհࢬଃಛᇆᠧಛ
ऱ c0ᐛଖΔঞଚױංᖄڕՀԿڤΚ
0log log [ ] [ ]
dx x s
k kc n M k n M k n M k n Δ (6-2)ڤ
0[ ] log [ ]s s
kc n M k n Δ (7-2)ڤ
0[ ] log [ ]d d
kc n M k n Δ (8-2)ڤ
խΔ [ ]x
M k n Ε [ ]s
M k n ፖ [ ]d
M k n (1-2)ڤխᠧಛଃಛᇆnx m Εଃ
ಛᇆns m ᠧಛא
nd m ڂଖΖנමዿଙᢜᐛழΔรkଡමዿៀᕴऱᙁګང
ᖄીᠧಛଃፖଃಛᇆհࢬᠧಛեឫࢤګףطΔנංᖄױଚݺڼ c0 ᐛଖऱฆ
0c n ΚقࢬڤՀڕ
0 0 0
[ ]log 1
[ ]
d
x s
k s
M k nc n c n c n
M k n
1 log 1
[ ]k
SNR k nΔ (9-2)ڤ
խڤ [ ]SNR k n ܛᆠรnଡଃխรkፂමዿऱಛᠧΔ
[ ][ ]
[ ]
s
d
M k nSNR k n
M k n
(10-2)ڤ
ᑇමዿऱಛᠧڍΔૉנױ(9-2)ڤط [ ]SNR k n ຟለՕழΔฆଖ0[ ]c n Ո
ઌኙΔڼڂຍױપฃᇞᤩଃհଃ(SNR ለՕ)ઌኙᠧಛଃ(SNR ለ) ᐙऱΖΔc0ᐛଖለլߢ
ᠧಛኙࢤګףଚല൶ಘݺՀא c0ᐛ٨ݧհᓳᢜ(modulation spectrum)ՂऱᐙΖଈ٣ԱංᖄದߠΔݺଚല(6-2)ڤΕ(7-2)ڤፖޏ(8-2)ڤᐊګՀ٨ԿڤΚ
0[ ] [ ] log exp [ ] exp [ ]x x s d
k kc n M k n M k n M k n (11-2)ڤ
0[ ] [ ]s s
kc n M k n (12-2)ڤ
0[ ] [ ]d d
kc n M k n (13-2)ڤ
խ [ ] log [ ]x x
M k n M k n Ε [ ] log [ ]s s
M k n M k n Ε [ ] log [ ]d d
M k n M k n Ζᣊۿ
ല(11-2)ڤፖ(2-2)ڤለΔױנᠧಛଃΕଃፖᠧಛԿऱᣂএڇ logEፖc0ᐛխԼᣊۿΔط៶ڼڂছհ(4-2)ڤፖ(5-2)ڤኙ logEᐛ٨ݧհᓳᢜऱංᖄΔݺଚאױኙޢଡමዿៀᕴᙁנऱኙᑇଖ٨ݧ
xM k n Δᐈսߢ
ਢՕ d
M k n ΔՈਢᎅ0
xc n ದ
0
dc n ലᖑޓڶՕऱᐈΔڼڂΔᣊۿ logE
ᐛऱΔݺଚٵᑌנᠧಛଃհ c0ᐛ٨ݧᠧಛհ c0ᐛ٨ݧᖑڶለڍऱᓳᢜګΔܛছ৵ޓڶऱՂՀᛯΖ
ቹԿ(a)ፖቹԿ(b)ԫଃಛᇆհ logEᐛ c0ᐛऱפᢜയ৫(power spectral density PSD)ڴᒵቹΔխऱଃಛᇆᠧಛ Aurora-20 ᇷறխऱFAC_5Z31ZZ4AᚾፖԳᜢᠧಛ(babble noise)Δಛᠧ 15dBΖطຍቹݺଚאױৰ
198
ΔߢᠧಛΔᠧಛଃઌኙנچ logEᐛ٨ݧፖ c0ᐛ٨ݧຟڶለՕऱᐈΔڼᢞԱݺଚհছऱංᖄΖ
ቹԿΕ౨ၦઌᣂᐛհפᢜയ৫ቹΔ(a) logEᐛΕ(b) c0ᐛ
ጵٽՂ૪ऱංᖄቹࠏΔݺଚᢞԱԫᠧಛଃխڶଃऱଃ logE ᐛ
ፖ c0 ᐛઌኙᠧಛଃߢΔ؈ట৫ለΔᖑڶለՕऱᐈΔڶܛለऱՂՀᛯΖഗՂ૪ᨠኘΔݺଚല༼נᄅऱൎࢤଃᐛݾ ᙩଃ
ᐛإ֏ऄ(silence feature normalization SFN)Δ հ৵ऱᆏխΖΔ૪ڤጟᑓڶΰԲαᙩଃᐛإ֏ऄ I (silence feature normalization I SFN-I)
֏ऄإ֏ऄΔጠհψᙩଃᐛإऱᙩଃᐛڤଚտฯรԫጟᑓݺᆏխΔءڇ
Iω (silence feature normalization I SFN-I)Ιڼऄਢಾኙᙩଃଃኙᑇ౨ၦإ֏ऄ(SLEN) [5]ߜޏאףΔؾऱਢݦඨኙ logEፖ c0հ౨ၦઌᣂᐛΔࠌԫಛᇆխॺଃ(non-speech)ຝऱᐛଖإ֏ΔڶଃհऱᐛଖঞঅլΔ ଃಛᇆհ౨ၦઌᣂᐛऱயΖנሒૹ৬א
ଈ٣Δݺଚ x n ԫᠧಛଃಛᇆհ logEᐛࢨ c0ᐛհ٨ݧΙᖕݺଚՂԫᆏࢬऱᓵΔᠧಛଃխڶଃऱઌለᠧಛΔ logEፖ c0 ᐛ٨ݧലᖑޓڶऱᓳᢜګΙݺڼڂଚૠԫຏ౧ᓢᚨៀᕴ(high-pass infinite impulse response filter)ࠐڼ٨ݧΔངࠤᑇڕՀΚ
1
1 0 1
1H z
zΖ (14-2)ڤ
ΚقࢬՀڕڤᣂএנៀᕴհᙁԵᙁڼ1y n y n x n Δ (15-2)ڤ
խyڤ n ៀᕴऱᙁנΔݺଚലࡨଖ 0 0y Ζ(14-2)ڤհៀᕴൎ৫ᚨ(magnitude response)ڕቹقࢬΔطቹխאױΔڼៀᕴ౨ജڶயچᐛ٨ݧխ൷२ऴ(near-DC)ऱګΔࠀലለऱຝאףൎᓳΔڼለऱګ
(a)
(b)
199
ऱࢬៀᕴང৵ڼᆖመڼڂଃፖᠧಛऱฆΖנડױ y n ലࡨᑇ
x n ᖑړޓڶऱய౨ܒࠐឰଃፖॺଃΖ
ቹΕ(14-2)ڤհຏៀᕴऱൎ৫ᚨ( 05 )
ᖕࢬ(15-2)ڤհy n ΔݺଚױԫಛᇆխଃፖॺଃଃऱܒΔࠀലॺଃऱଃإ֏Δܛڼᙩଃᐛإ֏ऄ I (silence feature normalization I SFN-I)ΔڕڤՀΚ
SFN-I if
log if
x n y n
x ny n
Δ (16-2)ڤ
խ Ε ፖ ាଖΕԫᄕऱإᑇאԫଖ 0 ฆᑇৰऱᙟᖲᑇΔx n ᆖመ SFN-I৵ࢬऱᄅᐛᑇΖាଖ ૠጩڕڤՀΚ
1
1N
n
y nN
Δ (17-2)ڤ
խNڤ ڼଃऱଃᑇΖڼڂΔាଖܛᖞଃڶࢬy n ऱଖΔૠ
ጩԼ១ΔᏁᠰ؆ૠհΖ ൕ(16-2)ڤנΔૉ [ ]y n Օាଖ ΔঞലࢬኙᚨհଃܒឰଃΔ
ᐛᑇঅլΙհঞലᣊॺଃଃΔࠀലᐛᑇإ֏ګԫᄕऱᙟ
ᖲᑇΙઌለհছᙩଃଃኙᑇ౨ၦإ֏ऄ(SLEN)[5]ߢΔᙩଃᐛإ֏ऄ IଖΔ֏ԫإലॺଃຝऱᐛᝩױ ಝᒭऱᜢᖂᑓীխऱࢬ౨ᖄીհ৵ױ
ฆᑇ(variance) 0ऱᙑᎄขسΖݺଚאױຘመቹնࠐᨠኘ SFN-IऄऱشΖቹնխΔ(a)ፖ(b)ࡨऱ logE ᐛא٨ݧ c0 ᐛڴ٨ݧᒵΙ(c)ፖ(d)ᆖመᙩଃᐛإ֏ऄ I৵ࢬհ logEᐛא٨ݧ c0ᐛڴ٨ݧᒵΔխદۥኔᒵਢኙᚨଃ(Aurora-20 ᇷறխऱFAK_3Z82Aᚾ)Εጸۥဠᒵፖ៴ۥរᒵঞኙᚨಛᠧ 15dBፖ 5dBऱᠧಛଃΖطຍࠄቹچנΔSFN-Iऄመ৵հ౨ၦઌᣂᐛଖאױለ२ࡨଃಛᇆհᐛଖΔሒ؈టऱؾऱΖ
ΰԿαᙩଃᐛإ֏ऄ II (silence feature normalization II SFN-II)
֏إ֏ऄΔጠհψᙩଃᐛإऱᙩଃᐛڤଚലտฯรԲጟᑓݺᆏխΔءڇ
ऄ IIω (silence feature normalization II SFN-II)ΔSFN-IIऄፖছԫᆏհ SFN-IऄՕऱฆڇΔSFN-II ਢല౨ၦઌᣂᐛ x n ଊՂԫᦞૹଖ(weight)Δᄅᐛଖx n ΖSFN-IIऱዝጩऄڕՀقࢬڤΚ
SFN-II x n w n x n (18-2)ڤ խΔ
1
2
1 1 exp if
if 1 1 exp
y n y n
w ny n
y n
Δ (19-2)ڤ
200
(a) (b)
(c) (d)
ቹնΕᙩଃᐛإ֏ऄ Iছ((a)ፖ(b))ፖ৵((c)ፖ(d))౨ၦઌᣂᐛڴ٨ݧᒵቹΔխ(a)ፖ(c) logEᐛڴ٨ݧᒵΔ(b)ፖ(d) c0ᐛڴ٨ݧᒵ
խy n Δقࢬ(15-2)ڤছԫᆏհڕ x n ຏመԫຏៀᕴհᙁנଖΔ ាଖΕ
1ፖ
2 y n y n (Օាଖ հڶࢬऱy n א( y n y n (
ាଖࢨ հڶࢬऱy n ኙᚨհᑑᄷΕࢬ( ԫᑇΖSFN-IIհាଖ ᇿ SFN-IઌٵΔૠጩڕڤՀقࢬΚ
1
1
N
n
N y n (20-2)ڤ
խNڤ ڼଃխଃᑇΖ ऱᦞૹଖw(19-2)ڤ n Δխقࢬቹքڕ 0Ε
11Ε
2א3 01Ζ
ᑇwࠤΔᦞૹଖאױቹքط n ԫଡլኙጠհᎠᏺऱ S ᒵ(sigmoidڴݮcurve)Δଖտ ࡉ0 1հΖڼᦞૹଖזࢬऱრᆠፖ SFN-IऄઌۿΔݺଚݦඨᄅऱ౨ၦઌᣂᐛx n ౨ڇࡨᐛଖৰՕழΔᕣၦፂլΙࡨଖለழΔঞࠌ
ޓΖSFN-IIऄࡉ SFN-IऄլٵհڇΔSFN-IIऄڶຌڤऱଃጤរೠ(soft-decision VAD)Δ SFN-I ऄঞڤऱଃጤរೠ(hard-decision VAD)Ιڼڂ SFN-II ऄઌለ SFN-I ऄߢΔ VAD ࠐ౨ઌኙױᙑᎄऱᐙܒለΔய౨ՈᄎለړΔຍංუലᄎڇհ৵ऱᆏᢞΖ
ቹքΕᦞૹଖࠤᑇ [ ]w n რቹقᒵڴ
ቹԮ SFN-IIऄছፖ৵౨ၦઌᣂᐛհڴᒵቹΖፖհছऱቹԿᣊۿΔ(a)ፖ(b)
201
ࡨऱ logE ᐛא٨ݧ c0 ᐛڴ٨ݧᒵΙ(c)ፖ(d)ᆖመᙩଃᐛإ֏ऄ II ৵ࢬհ logE א٨ݧ c0 ኔᒵਢኙᚨଃۥᒵΔխદڴ٨ݧ(Aurora-20 ᇷறխऱFAK_3Z82Aᚾ)Εጸۥဠᒵፖ៴ۥរᒵঞኙᚨಛᠧ15dBፖ 5dBऱᠧಛଃΖৰچΔᆖط SFN-IIመ৵հᠧಛଃऱ౨ၦઌᣂᐛΔઃᣊۿ SFN-Iऄऱய Δޓאױ२ࡨଃհᐛΔڶயᠧಛທګऱ؈టΖ
(a) (b)
(c) (d)
ቹԮΕᙩଃᐛإ֏ऄ IIছ((a)ፖ(b))ፖ৵((c)ፖ(d))౨ၦઌᣂᐛڴ٨ݧᒵቹΔխ(a)ፖ(c) logEᐛڴ٨ݧᒵΔ(b)ፖ(d) c0ᐛڴ٨ݧᒵ
ԿΕ౨ၦઌᣂᐛݾհኔፖಘᓵ ΰԫαΕଃᇷற១տ
ᄎ(EuropeanऱଃᇷறᑛሽॾᑑᄷشࠌࢬᓵխऱଃᙃᢝኔءTelecommunication Standard Institute ETSI)ऱ Aurora-20ற[7]Ζਢԫ៶طᠧಛΔࢤګףՂԶጟףՖΔߊڣګΔભഏڗڗᙕ፹ऱຑᥛᑇڤԳՠऱא
چՀᥳΕԳᜢΕΕ୶ᥦ塢Ε塊ᨚΕဩሐΕᖲΕ־Δאլٵ৫ऱ
ಛᠧΔ 20dBΕ15dBΕ10dBΕ5dBΕ0dB5-אdBΔॵףՂ(clean)றΖ ΰԲαΕᐛᑇऱፖᙃᢝߓอऱಝᒭ
ᓵᖕء Aurora-20ኔறᑑᄷ[7]Δଃᐛᑇਢشࠌමዿଙᢜএᑇ(mel-frequency cepstral coefficients MFCC)౨ၦઌᣂᐛΔॵףՂԫၸၦፖԲၸၦΖԱ౨ၦઌᣂᐛऱᐙΔءᓵխආشլٵऱᐛᑇΙรԫ
ਢ 12 ፂමዿଙᢜᐛଖ(c1Дc12)ףՂ 1ፂऱኙᑇ౨ၦ(logE)Δԫঞਢشࠌ 12ፂමዿଙᢜᐛଖ(c1Дc12)ףՂรሿፂଙᢜᐛএᑇ(c0)Ιޢઃᄎף٦ՂԫၸፖԲၸၦΔਚઃشԱ 39ፂऱᐛᑇΖᇡาऱᐛᑇΔڕԫقࢬΖ
شܓଚݺ HTKࠐ[8]ڤಝᒭᜢᖂᑓীΔขسԱ 11(oh zero one~nine)ଡᑇڗᑓীא
202
ᙩଃᑓীΔޢଡᑇڗᑓীץ 16ଡणኪ(states)Δޢଡणኪਢط 20ଡཎയ৫ࠤᑇٽ(Gaussian mixtures)ګࢬΖ
ԫΕءᓵխشࠌࢬհଃᐛᑇ ᑌ 8kHzଃ९৫(Frame Size) 25ms 200រ ଃฝ(frame Shift) 10ms 80រ ቃൎᓳៀᕴ 1
1 097z ڤݮ ዧ(Hamming window) ແمᆺངរᑇ 256រ
ៀᕴ(filters) මዿ৫ԿߡៀᕴΔ ٥ 23ଡԿߡៀᕴ
ᐛٻၦ (feature vector)
รԫΚ 1 12ic i
1 12ic i 2
1 12ic i
logE logE 2logE
٥ૠ 39ፂ
รԲΚ 1 12ic i
1 12ic i 2
1 12ic i
0c 0c 20c
٥ૠ 39ፂ
ΰԿαଃᙃᢝኔ ለࠀऱଃᙃᢝΔݾࢤጟಾኙ౨ၦઌᣂᐛհൎٺଚലചݺຍԫᆏխΔڇ
ய౨ΖೈԱݺଚࢬᄅ༼נऱᙩଃᐛإ֏ऄΰSFN-IፖSFN-IIα؆ΔݺଚٵழኔԱፖฆᑇإ֏ऄ(mean and variance normalization MVN)[9]Ε ፖฆᑇإ
֏ॵףARMAៀᕴऄ(MVN plus ARMA filtering MVA)[10]Εอૠቹ֏ऄ(histogram equalization HEQ)[11]Εኙᑇ౨ၦ೯ኪᒤإ֏ऄ (log-energy dynamic range normalization LEDRN)[3]Εኙᑇ౨ၦ৫ૹऄ (log-energy rescaling normalization LERN)[4]ፖᙩଃኙᑇ౨ၦإ֏ऄ(silence log-energy normalization SLEN)[5]ΔଖრऱਢΔࡨհMVNΕMVAፖHEQԿऄឈਢૠڶࢬጟᣊऱᐛՂΔݺଚԱေ۷Δ؆֏ՂΔإlogEፖc0ᐛऱشലଚሎຍᇙڇ౨ၦઌᣂᐛऱய౨ΔڇLEDRNऄڶᒵࢤፖॺᒵࢤጟΔڇຍᇙݺଚאLEDRN-IፖLEDRN-IIقΔLERNڶጟठءΔݺଚאLERN-IፖLERN-IIقΖ 1Εಾኙኙᑇ౨ၦᐛ(logE)հൎڤଃݾጵٽ
ܛଃᐛছ૪հรԫऱᐛᑇΔشࢬᆏհኔڼ 12 ፂමዿଙᢜᐛଖ(c1Дc12)ףՂ 1 ፂऱኙᑇ౨ၦ(logE)ΔॵףԫၸፖԲၸၦΔ٥ 39 ፂΖຍᇙऄΔઃਢࢤऱԼጟᐛൎشࢬ logEᐛΔլەᐞ 12ፂऱමዿଙᢜএᑇΔԲנ٨ԱഗኔຍԼጟऄࢬհᙃᢝΰ20dBΕ15dBΕ10dBΕ5dBፖ 0dBնጟಛᠧՀऱᙃᢝαΔխ ARፖ RRઌለഗհኙᙑᎄ(absolute error rate reduction)ࡉઌኙᙑᎄ(relative error rate reduction)ΖൕԲऱᑇᖕΔݺଚױᨠኘՀ٨រΚ Ϥ1 ڶࢬشࡨጟᣊᐛհ MVNΕMVA ፖ HEQ ऄش logE ᐛழΔ
ڶഗኔΔயՈԼΔઌለޏऱࠎ༽ 1018Ε1170ፖ1497ऱᙃᢝ༼Ζઌኙ MVN طΔߢ MVA Աԫଡشࠌڍ ARMA ຏៀᕴאൎᓳଃऱګΔ HEQ ᠰ؆ኙଃᐛऱၸ೯(higher-order moments)إ֏ΔאࢬயઃMVNᝫࠐړΖ
203
Ϥ2 հಾኙנ༽ࢬא logE ᐛᇖᚍऱٺጟऄΚLEDRN-IΕLEDRN-IIΕLERN-IΕ LERN-II ፖ SLENΔຟ౨ࠐԼထऱᙃᢝ༼ΔխᒵࢤLEDRN(LEDRN-I)ᚌॺᒵࢤ LEDRN(LEDRN-II)ΔᙃᢝઌԱՕપ 4Δጟठءऱ LERN(LERN-Iፖ LERN-II)ΔயঞԼ൷२Δᚌ LEDRNΖءኔመנ༽ࢬװऱ SLENऄΔઌኙഗኔऱᙃᢝߢΔڶ 1519ऱ༼Δᚌհছࢬ༼հ LEDRNፖ LERNऄΖ
Ϥ3 ֏ऄΔSFN-Iإऱጟᙩଃᐛנ༽ࢬᓵء ፖ SFN-IIΔઌኙഗኔΔᙃᢝ༼Աߢ 1538ፖ 1611Δઌኙᙑᎄຟڇ ՂΔઌא50ለհছࢬ༼ऱٺጟऄΔSFN-Iፖ SFN-IIຟޓڶᚌฆऱΔڼᢞԱݺଚࢬ༼ऱଡᄅऄΔຟ౨ڶயچ༼ logE ᐛࢤګףڇᠧಛᛩቼՀऱൎࢤΔᚌؾছڍထټऱ logEᐛإ֏ݾΖڼ؆ΔݺଚՈΔSFN-IIࢬհᙃᢝ SFN-IړޓΔط૪Δࢬհছڕڂ౨ױڼ SFN-IIڇଃೠ(voice activity detection)ऱᖲፖSFN-IࠀլઌٵΔଃೠհᙑᎄڇ SFN-IIխઌኙᐙለΔࠌઌኙለࠋΖ
ԲΕಾኙ logEᐛհൎڤଃݾհᙃᢝऱጵٽለ()
Method Set A Set B average AR RR (1) Baseline 7198 6779 6989 ѧ ѧ
(2) MVN 7904 8108 8006 1018 3379 (3) MVA 8053 8264 8159 1170 3885 (4) HEQ 8391 8579 8485 1497 4969 (5) LEDRN-I 8201 7970 8086 1097 3643 (6) LEDRN-II 7721 7553 7637 649 2153 (7) LERN-I 8364 8335 8350 1361 4519 (8) LERN-II 8271 8194 8233 1244 4131 (9) SLEN 8487 8527 8507 1519 5042
(10) SFN-I 8502 8550 8526 1538 5105 (11) SFN-II 8567 8632 8600 1611 5349
2Εಾኙรሿፂଙᢜᐛএᑇ(c0)հൎڤଃݾጵٽ ܛଃᐛছ૪հรԲऱᐛᑇΔشࢬᆏհኔڼ 12 ፂමዿଙᢜᐛଖ(c1Дc12)ףՂรሿፂଙᢜᐛএᑇ(c0)ΔॵףԫၸፖԲၸၦΔ٥ 39 ፂΖᣊಾኙࡨଚലݺছԫᆏΔۿ logEᐛऱԼጟᐛൎࢤऄΔش c0ᐛՂΔ 12ፂऱමዿଙᢜএᑇঞፂլΖឈؾছऱਢ c0ᐛΔԱ១ದߠΔຍᇙݺଚլലٺءጟݾऱټጠଥޏΔڕࠏ LEDRNऄΔݺଚࠀլലټޏc0-DRNऄΔսऎټΔהऄټጠڼࠉᣊංΖ Կנ٨ԱഗኔຍԼጟऄࢬհᙃᢝΰ20dBΕ15dBΕ10dBΕ5dBፖ 0dBնጟಛᠧՀऱᙃᢝαΔխऱ AR ፖ RR ઌለഗኔհኙᙑᎄࡉઌኙᙑᎄΖൕԿऱᑇᖕΔݺଚױᨠኘՀ٨រΚ
Ϥ1 ᣊۿհছऱԲհΔٺጟऄش c0ᐛழΔຟ౨ࠐ༼ᙃᢝऱயΔխΔLEDRN-Iፖ LEDRN-IIऱהऄΔ ਢ LEDRN-IIΔڶ 357 հኙᙑᎄ(AR)Δױ౨ڇڂΔLEDRNءਢಾኙ logEᐛࢬૠΔૉݺଚऴ൷ലش c0ᐛՂΔشࠌࢬऱᑇࠀॺਢࠋ֏ΔᖄીயլኦΖ Ϥ2ԿጟڶࢬشءጟᣊᐛհऄΚMVNΕMVAፖ HEQऄΔش c0ᐛ
204
ழΔսא HEQړΔMVAऄڻհΔMVNऄለΔڼऱڕآࠀհছڇԲࠐΖڼ؆ΔLERN-IΕLERN-II ፖ SLEN ຟڶԼထऱޏயΔፖԲऱᑇᖕլٵհΔڇԿጟऄऱய౨Լ൷२Δ LERN-Iฃᚌ SLENΖ Ϥ3ءᓵנ༽ࢬऱጟᙩଃᐛإ֏ऄΔSFN-Iፖ SFN-IIΔઌኙഗኔߢΔᙃᢝ༼Ա 1379ፖ 1413Δઌኙᙑᎄપ 46ΔᣊۿԲऱΔSFN-IIսᚌ SFN-IΔຍጟऄհսᚌڶࢬהऱऄΖڼᢞԱݺଚࢬ༼ऱଡᄅऄΔ౨ڶயچ༼ c0ᐛࢤګףڇᠧಛᛩቼՀऱൎࢤΖ
ԿΕಾኙ c0ᐛհൎڤଃݾհᙃᢝऱጵٽለ() Method Set A Set B Average AR RR
(1) Baseline 7195 6822 7009 ѧ ѧ
(2) MVN 8080 8295 8188 1179 3941 (3) MVA 8176 8404 8290 1282 4284 (4) HEQ 8289 8459 8374 1366 4565 (5) LEDRN-I 7904 7736 7820 811 2713 (6) LEDRN-II 7408 7322 7365 357 1192 (7) LERN-I 8381 8365 8373 1365 4561 (8) LERN-II 8303 8253 8278 1270 4244 (9) SLEN 8294 8428 8361 1353 4521
(10) SFN-I 8304 8470 8387 1379 4608 (11) SFN-II 8329 8514 8422 1413 4723
ឈ SFNऄڶயچᠧಛኙ c0ທګऱ؈టΔ༼ᙃᢝΔᅝݺଚለԲፖԿழΔᓵਢ SFN-Iࢨ SFN-IIΔش logEᐛױऱᙃᢝᄎڇش c0 ᐛࢬհᙃᢝΙڼطΔݺଚංឰط logE ᐛࢬհ SFN-I ऄፖ SFN-II ऄխऱଃጤរೠ(VAD)Δױ౨ᄎط c0 ଚݺංუΔڼΖᖕړऱࠐࢬലࠐಾኙ c0 ᐛऱጟ SFN ऄଥޏΖ SFN-I խΔݺଚشܓ٣ logE ኙଃଃॺଃऱᣊΔ٦ലܒڼش c0ՂΔኙॺଃଃऱ c0(16-2)ڤڕհ֏Ιإ SFN-IIՈਢشܓઌٵऱڤΔشܓ٣ logEኙଃଃॺଃऱᣊΔ٦ലང c0 ՂΔࠀኙଃፖॺଃଃऱ c0 ᐛشࢬ(19-2)ڤޣ٨ݧऱᑑᄷ
1ፖ
2Δ৵(18-2)ڤհإ֏ΖݺଚലאՂऱଥإऄጠಾኙ c0
ᐛհଥڤإ SFN-Iऄ(modified SFN-I)ፖଥڤإ SFN-IIऄ(modified SFN-II)Ζ ಾኙ c0ᐛհଥڤإ SFN-Iऄፖଥڤإ SFN-IIऄΔࢬհᙃᢝڕࢬ
ڤإቃཚऱΔଥࢬଚݺڕΔق SFNऄઌኙࡨ SFNऄΔ౨ޓڶԫޡऱޏயΔኙ SFN-I ৵ᠰ؆༼ԱΔছઌለߢ 129ऱᙃᢝΔኙ SFN-II Δߢছઌለ৵ᠰ؆༼Աऱ 133ᙃᢝΖڼຝᢞԱݺଚऱංუΔܓܛش logEᐛࠐചଃጤរೠ(VAD)Δயᄎ c0ᐛࠐऱړΖ
Εಾኙ c0ᐛհࡨ SFNऄፖଥڤإ SFNऄհᙃᢝለ() Method Set A Set B Average AR RR Baseline 7195 6822 7009 ѧ ѧ
SFN-I 8304 8470 8387 1379 4608 modified SFN-I 8454 8579 8517 1508 5041
SFN-II 8329 8514 8422 1413 4723 modified SFN-II 8503 8606 8555 1546 5168
205
Εᙩଃᐛإ֏ऄፖᐛൎऄٽհኔፖಘᓵ ছԫհԫ٨ߓऱኔΔਢ൶ಘٺጟ౨ၦઌᣂᐛݾய౨Δડנ
ڶኔխΔࠄ֏(SFN)ऄऱᚌฆΔຍإհᙩଃᐛנ༽ᄅࢬଚݺ logE ፖ c0 ጟ౨ၦઌᣂᐛΔ 塒ऱමዿଙᢜᐛএᑇ(c1~c12)ঞፂլΖڇຍԫխΔشଚቫᇢലݺ logEፖ c0ᐛऱ SFNऄፖش c1~c12հමዿଙᢜᐛএᑇऱൎݾࢤאףٽΔ៶אᨠኘհਢࢤګףڶܡΔ౨ԫޏޡଃᙃᢝΖ
հ༽ࢬଚᙇᖗհছݺຍᇙΔڇ MVN[9]ΕMVA[10]א HEQ[11]ԿጟൎݾࢤΔش c1~c12հමዿଙᢜᐛএᑇՂΔലݺଚࢬ༼հ SFN-Iࢨ SFN-IIऄشࢨ౨ၦઌᣂᐛ(logE c0)ՂΔݺଚലՂ૪ڶࢬऱኔნᖞګնፖքΖ
ಾኙรԫᐛ(logE c1~c12)հնऱᑇᖕխΔ٨(2)~(4)ਢشܓԫൎݾ(MVN MVAࢨ HEQ)٤ຝᐛᑇհΔ٨(5)~(10)ঞᙩଃᐛإ֏ऄ(SFN)ٽऄհΖᅝݺଚല٨(2)Ε٨(5)ፖ٨(8)ऱઌለΕ٨(3)Ε٨(6)ፖ٨(9)ऱઌለΔ٨(4)Ε٨(7)ፖ٨(10)ऱઌለΔຟאױנല SFN-I ࢨSFN-II شࠌ logEᐛΔהشࠀऄڇشࠌ c1Дc12 ᐛՂΔࢬऱᙃᢝᗑشࠌԫጟऄ٤ຝᐛऱᙃᢝנڍΔ(9)٨ڕࠏհόSFN-II (logE) + MVA (c1~c12)ύऄΔᙃᢝሒ 8997ΔԱ٨(4)հόHEQ (logE c1~c12)ύऄࢬհ 8744ऱᙃᢝΖٵழΔݺଚՈנ SFN-IIऱய౨ཏሙᚌ SFN-IΔڼᇿছԫऱᓵਢԫીऱΖᅝݺଚലնፖԲऱᑇᖕઌለழΔՈאױנΔشࠌ
SFN logEᐛشࠌٽMVNΕMVAࢨ HEQऄᠰ؆ c1Дc12ᐛΔאױᗑشࠌ SFN logEᐛࠋޓऱᙃᢝயΔڼᢞԱ SFNऄፖMVNΕMVAࢨ HEQऄऱᒔࢤګףڶΖ
նΕSFNऄڇش logEᐛٽଃൎݾش c1Дc12ᐛᑇհᙃᢝऱጵٽለ()
Method Set A Set B average AR RR (1) Baseline 7198 6779 6989 ѧ ѧ
(2) MVN (logE c1~c12) 8355 8375 8365 1377 4571 (3) MVA (logE c1~c12) 8669 8689 8679 1691 5613 (4) HEQ (logE c1~c12) 8715 8772 8744 1755 5828 (5) SFN-I (logE) + MVN (c1~c12) 8733 8781 8757 1769 5872 (6) SFN-I (logE) + MVA (c1~c12) 8840 8884 8862 1874 6221 (7) SFN-I (logE) + HEQ (c1~c12) 8793 8804 8799 1810 6010 (8) SFN-II (logE) + MVN (c1~c12) 8845 8888 8867 1878 6236 (9) SFN-II (logE) + MVA (c1~c12) 8982 9012 8997 2009 6669
(10) SFN-II (logE) + HEQ (c1~c12) 8929 8933 8931 1943 6450 ಾኙรԲᐛ(c0 c1~c12)հքऱᑇᖕխΔ٨(2)~(4)ਢشܓԫൎݾ
(MVN MVAࢨ HEQ)٤ຝᐛᑇհΔ٨(5)~(16)ঞᙩଃᐛإ֏ऄ(SFN)ٽऄհΖᣊۿնխ٨(1)~(10)ܧࢬऱΔൕքխհ٨(1)~(10)ፖԿऱᑇᖕઌለΔشࠌ SFN c0ᐛشࠌٽMVNΕMVAࢨ HEQऄᠰ؆ c1Дc12ᐛΔאױᗑشࠌ SFN c0ᐛࠋޓऱய౨ΔݺଚΔല SFN-Iࢨ SFN-IIشࠌ c0ᐛΔהشࠀऄڇشࠌ c1Дc12ᐛழΔࢬऱᙃᢝࠀॺਢᚌᗑشࠌԫጟऄ٤ຝᐛऱᙃᢝ ΰຍࠄለऱᑇᖕڇ
խאᇆאףಖαΔ(6)٨ڕࠏհόSFN-I (c0) + MVA (c1~c12)ύऄΔᙃᢝ
206
8777Δઌለ(3)٨հόMVA (c0 c1~c12)ύऄࢬհ ౨ױऱڼΖࠐ8846شܓܛছԫբᆖ༼ΔڇΔڂ c0ᐛച SFNऄխऱଃጤរೠ(VAD)ᄎለլ壄ᒔΔ SFNऱய౨ΖڼڂΔᣊۿছԫΔڇຍᇙݺଚشࠌಾኙ c0ᐛհଥإऱڤ SFNऄΔࠐፖMVNΕMVAࢨ HEQऄٽΔຍࠄ٨քऱ٨(11)~(16)խΖ
քΕSFNऄڇش c0ᐛٽଃൎݾش c1Дc12ᐛᑇհᙃᢝጵٽለ()
Method Set A Set B Average AR RR (1) Baseline 7195 6822 7009 ѧ ѧ
(2) MVN (c0 c1~c12) 8503 8554 8529 1520 5081 (3) MVA (c0 c1~c12) 8811 8881 8846 1838 6142 (4) HEQ (c0 c1~c12) 8699 8813 8756 1748 5842 (5) SFN-I (c0) + MVN (c1~c12) 8562 8662 8612 1604 5360 (6) SFN-I (c0) + MVA (c1~c12) 8738 8816 8777 1769 5912 (7) SFN-I (c0) + HEQ (c1~c12) 8595 8653 8624 1616 5400 (8) SFN-II (c0) + MVN (c1~c12) 8692 8769 8731 1722 5756 (9) SFN-II (c0) + MVA (c1~c12) 8904 8961 8933 1924 6432
(10) SFN-II (c0) + HEQ (c1~c12) 8743 8788 8766 1757 5873 (11) modified SFN-I (c0) + MVN (c1~c12) 8749 8789 8769 1761 5885 (12) modified SFN-I (c0) + MVA (c1~c12) 8930 8954 8942 1934 6463 (13) modified SFN-I (c0) + HEQ (c1~c12) 8810 8839 8825 1816 6071 (14) modified SFN-II (c0) + MVN (c1~c12) 8825 8833 8829 1821 6086 (15) modified SFN-II (c0) + MVA (c1~c12) 8987 8998 8993 1984 6632 (16) modified SFN-II (c0) + HEQ (c1~c12) 8925 8946 8936 1927 6442 ലքհ٨(11)~(16)ऱᑇᖕፖ٨(1)~(10)ઌለΔݺଚאױנಾኙ c0ᐛհଥإڤ SFNऄ(modified SFN-I ፖ modified SFN-II)Δࡨ SFNऄऱய౨נڍΔፖMVNΕMVAࢨ HEQԫشࠌࠓ৵ΔᚌMVNΕMVAࢨ HEQڶࢬᐛऱΔխ(15)٨אհόmodified SFN-II (c0) + MVA (c1~c12)ύऄࢬऱᙃᢝΔ 8993Δፖհছնխࠋᙃᢝ 8997ΰ٨(9)ऱόSFN-II (logE) + MVA (c1~c12)ύऄαԼ൷२ΔڼᢞԱଥڤإ SFN ऄᒔኔޓԫޏޡԱ c0 ᐛࢤګףڇᠧಛᛩቼՀऱൎࢤΖ
ऱጟᙩଃᐛנ༽ࢬᢞךאױଚݺรԿፖรհ٤ຝऱኔᑇᖕխΔط
֏ऄ(SFN-Iፖإ SFN-II)ኙ౨ၦઌᣂᐛړߜڶऱൎ֏யΔ SFN-IIࢬऱᙃᢝਢ SFN-I Δױ౨ڕڂรԲࢬຫ૪Δڂ SFN-II ऄڶຌڤհଃጤរೠ(soft-decision voice activity detection)ऱᖲΔઌለ SFN-Iऄڤհଃጤរೠ(hard-decision voice activity detection)ऱᖲ ΔছऱଃॺଃܒᙑᎄࢬທګऱᐙઌኙለΖΔਔߢΔSFN-Iऄ SFN-IIऄऱٵ٥ᚌរڇചՂԼ១ΰܛᓤᠧ৫ᄕαயৰᚌฆΔڼڂᄕኔشऱᏝଖΖ
նΕᓵ ʳݾԫଡᄅऱଃൎנ༽ଚݺᓵխΔءڇ Ϋψᙩଃᐛإ֏ऄω(silence
feature normalization SFN)ΔڼऄചՂԼ១யᚌฆΖਢಾኙ౨ၦઌᣂ
207
ᐛ(logEፖ c0)ࢤګףڂᠧಛທګऱ؈టᔞᅝऱᇖᚍΖSFNऄشܓԱԫଡຏៀᕴװࡨ౨ၦઌᣂᐛ٨ݧΔࠀലຏመڼຏៀᕴࢬऱհᙁנᐛ٨ݧஞࠐ
ଃॺଃऱᣊΔࠀᚨش១ڶயऱऄࠐॺଃຝऱᐛΔലᠧಛኙଃᐛऱեឫΔאཚ༼ಝᒭፖᇢᛩቼ৫Δ ༼ᠧಛᛩቼՀऱଃᙃᢝΖ ൎڍאኔءΔSFNऄഗߢΔ౨ၦઌᣂᐛױኔᑇᖕխط אڇवಾኙ౨ၦઌᣂᐛᔞᅝऱᇖᚍΔױڼطऱᙃᢝΙړޓݾଃڤ
ॺᠧಛᛩቼՀઃԼထऱᙃᢝ༼ ΔقԱ౨ၦઌᣂᐛࢬऱଃᦸ
ᇷಛਢᐙᙃᢝऱԫଡૹਐᑑΖڼ؆Δᅝݺଚല SFN ऄፖൎڤଃݾٽΔᙃᢝᗑشࠌԫጟൎڤଃݾࢬऱᙃᢝޓΔխԾא
SFN-IIऄٽMVAऄऱᙃᢝΔױሒല२ 90ऱᙃᢝΖ ౨ၦઌᣂᐛឈ৫ଃᦸԺΔਢᠧಛኙեឫ৫ՈઌኙৰՕΔڼڂ౨
ၦઌᣂᐛऱړᡏΔലᄎৰऴ൷چᐙߓอऱᙃᢝய౨Δױڼطव౨ၦઌᣂᐛ
ऱൎ֏ࠐآڇսਢଖ൶ಘऱԫՕᓰᠲΙݺଚݦඨאױࠐآലࢬ୶ऱݾΔឩ
୶ᇢለՕڗნၦऱଃᙃᢝߓอՂΔ൶ಘຍᣊݾڇլٵᓤᠧ৫հଃᙃᢝߓ
อऱய౨Ζ؆ΔݺࠐآଚսױཛٻೈࢤګףᠧಛऱٻᤉᥛԵઔߒΔՈאױಾኙ
ೈຏሐࢤᠧಛऱऄװઌᣂऱ൶ಘΔࠀቫᇢലٽΔࠌଃᙃᎁߓอ౨ڶޓ
யٺچᣊᠧಛऱեឫΔᖑחڶԳየრհᙃᢝΖ ە [1] Bocchieri E L and Wilpon J G Discriminative Analysis for Feature Reduction in
Automatic Speech Recognition 1992 International Conference on Acoustics Speech and Signal Processing (ICASSP 1992)
[2] Julien Epps and Eric HC Choi An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR 2005 European Conference on Speech Communication and Technology (Interspeech 2005mdashEurospeech)
[3] Weizhong Zhu and Douglas OrsquoShaughnessy Log-Energy Dynamic Range Normalization for Robust Speech Recognition 2005 International Conference on Acoustics Speech and Signal Processing (ICASSP 2005)
[4] Hung-Bin Chen On the Study of Energy-Based Speech Feature Normalization and Application to Voice Activity Detection MS thesis National Taiwan Normal University Taiwan 2007
[5] C-F Tai and J-W Hung Silence Energy Normalization for Robust Speech Recognition in Additive Noise Environments 2006 International Conference on Spoken Language Processing (Interspeech 2006mdashICSLP)
[6] Tai-Hwei Hwang and Sen-Chia Chang Energy Contour Enhancement for Noisy Speech Recognition 2004 International Symposium on Chinese Spoken Language Processing (ISCSLP 2004)
[7] H G Hirsch and D Pearce The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions Proceedings of ISCA IIWR ASR2000 Paris France 2000
[8] httphtkengcamacuk [9] S Tiberewala and H Hermansky Multiband and Adaptation Approaches to Robust
Speech Recognition 1997 European Conference on Speech Communication and Technology (Eurospeech 1997)
[10] C-P Chen and J-A Bilmes MVA Processing of Speech Features IEEE Trans on Audio Speech and Language Processing 2006
[11] A Torre J Segura C Benitez A M Peinado and A J Rubio Non-Linear Transformations of the Feature Space for Robust Speech Recognition 2002 International Conference on Acoustics Speech and Signal Processing (ICASSP 2002)
208
Robust Features for Effective Speech and Music Discrimination
Zhong-hua Fu1 Jhing-Fa Wang2
School of Computer Science
Northwestern Polytechnical University Xirsquoan China1
Department of Electrical Engineering National Cheng Kung University Tainan Taiwan1 2
mailfzhmailnckutw1 wangjfcsienckuedutw2
Abstract
Speech and music discrimination is one of the most important issues for multimedia information retrieval and efficient coding While many features have been proposed seldom of which show robustness under noisy condition especially in telecommunication applications In this paper two novel features based on real cepstrum are presented to represent essential differences between music and speech Average Pitch Density (APD) Relative Tonal Power Density (RTPD) Separate histograms are used to prove the robustness of the novel features Results of discrimination experiments show that these features are more robust than the commonly used features The evaluation database consists of a reference collection and a set of telephone speech and music recorded in real world
Keywords SpeechMusic Discrimination Multimedia Information Retrieval Real Cepstrum
1 Introduction
In applications of multimedia information retrieval and effective coding for telecommunication audio stream always needs to be diarized or labeled as speech music or noise or silence so that different segments can be implemented in different ways However speech signals often consist of many kinds of noise and the styles of music such as personalized ring-back tone may differ in thousands ways Those make the discrimination problem more difficult
A variety of systems for audio segmentation or classification have been proposed in the past and many features such as Root Mean Square (RMS) [1] Zero Crossing Rate (ZCR) [145] low frequency modulation [245] entropy and dynamism features [236] Mel Frequency Cepstral coefficients (MFCCs) have been used Some features need high quality audio signal or refined spectrum detail and some cause long delay so as not fit for telecommunication applications While the classification frameworks including nearest neighbor neural network Hidden Markov Model (HMM) Gaussian Mixture Modal (GMM) and Support Vector Machine (SVM) have been adopted as the back end features are still the crucial factor to the final performance As shown in the following part of this paper the discrimination abilities of some common features are poor with noisy speech The main reason may explain as that they do not represent the essential difference between speech and music
In this paper two novel features called as Average Pitch Density (APD) and Relative Tonal
209
Power Density (RTPD) are proposed which are based on real cepstrum analysis and show better robustness than the others The evaluation database consists of two different data sets one comes from Scheirer and Slaney [5] the other is collected from real telecommunication situation The total lengths for music and speech are about 37 minutes and 287 minutes respectively
The rest of this paper is organized as follows Section 2 introduces the novel features based on real cepstrum analysis Section 3 describes the evaluation database and the comparative histograms of different features The discrimination experiments and their results are given in section 4 Section 5 concludes this paper
2 Features Based on Real Cepstrum
There are tremendous types of music and the signal components of which can be divided into two classes tonal-like and noise-like The tonal-like class consists of tones played by all kinds of musical instruments and these tones are catenated to construct the melody of music The noise-like class is mainly played by percussion instruments such as drum cymbal gong maracas etc The former class corresponds to the musical system which construct by a set of predefined pitches according to phonology The latter class can not play notes with certain pitch and is often used to construct rhythm
The biggest difference between speech and music lies on the pitch Because of the restriction of musical system the pitch of music usually can only jump between discrete frequencies except for vibratos or glissandi But pitch of speech can change continuously and will not keep on a fixed frequency for a long time Besides the difference of pitch character the noise part of music which is often played by percussion instrument also has different features from speech That part of music does not have pitch but it usually has stronger power This phenomenon seldom exists in speech signal because generally the stronger part of speech is voiced signal which does have pitch
In order to describe the differences of pitch between speech and music we use real cepstrum instead of spectrogram Cepstrum analysis is a more powerful tool to analysis the detail of spectrum which can separate pitch information from spectral envelop The real cepstrum is defined as (Eq (2) gives the Matlab expression)
deeXrealRC njjx log
21
ˆ
(1)
xfftabsifftrealRCx log (2)
Where is a frame of audio signal weighted by hamming window of which the discrete Fourier transform is
xjeX denotes extracting the real part of the complex results
are the coefficients of real cepstrum The coefficients that near zero origin reflect the big scale information of power spectrum such as the spectrum envelop and those far from the zero origin show the spectrum detail Figure 1 uses the latter to demonstrate the differences of pitch between speech and music It is clear that the music pitches are jumped discretely while speech pitches do not Figure 2 uses spectrogram to show the noise-like feature of a rock music segment where most ictus have no pitch
)(real
xRC
210
(a)real cepstrum of music50 100 150 200 250 300 350 400
20
40
60
80
100
(b)real cepstrum of speech50 100 150 200 250 300 350 400
20
40
60
80
100
Figure 1 Pitch different between music (a) and speech (b) by means of
real cepstrum Only coefficients far from the zero origin are used
05 1 15 2 25 3
x 104
-05
0
05
(a)wave graph
(b) spectragram50 100 150 200 250 300 350 400
25kHz
1875KHz
125kHz
0625kHz
Figure 2 Waveform and spectrogram of a segment of rock music It is
clear to find that most ictus have no pitch
To parameterize the above conclusion we propose two novel features Average Pitch Density (APD) and Relative Tonal Power Density (RTPD)
A APD feature
211
Because of the musical instruments and polyphony the average pitch usually is higher than speech The APD feature is independent with signal power and reflects the details about spectrum which is defined as
NNK
NKi
l
lji jRCx
LKAPD
1
2
1
)(1)( where 112 llL (3)
where K means the K-th analysis segment and N is the length of it L is number of RCx coefficients that far from zero origin whose range is l1 to l2 This feature is relative simple but it does prove to be robust for discrimination between speech and music The histogram in figure 3 (e) demonstrate this conclusion
B RTPD feature
While the detail information about spectrum can be used to discriminate tonal or song from speech the variation of energy combined with pitch information may be used to separate percussive music from noisy speech In clean or noisy speech signal the segments that show clear pitch usually are voiced speech which are likely to have bigger energy So if all segments with pitch are labeled as tonal parts and the others are label as non-tonal parts we can probably say that if the energy of tonal parts is smaller than that of non-tonal parts then the segment may not be speech otherwise the segment can be speech or music
In order to label tonal and non-tonal parts we still use real cepstrum Since if clear pitch does exist a distinct stripe will appear in real cepstrum even if in noise condition We use the peak value of RCx that far from zero origin to judge tonal or non-tonal The threshold we choose is 02 Frames whose peak value is bigger than 02 are labeled as tonal or else are labeled as non-tonal Thus the RTPD can be defined as
)()( jjiiRMSmeanRMSmeanKRTPD (4)
where consists of all tonal frames of K-th analysis segment and is the entire set of frames of the segment RMSi is the root mean square of the i-th frame
3 Discrimination Ability Due to the lack of a standard database for evaluation the comparisons between different features are not easily Our evaluation database consists of two parts one comes from collection of Scheirer and Slaney[5] the other comes from the real records from telecommunication application The former includes speech sets and music sets Each set contains 80 15-second long audio samples The samples were collected by digitally sampling an FM tuner (16-bit monophonic samples at a 2205 kHz sampling rate) using a variety of stations content styles and noise levels They made a strong attempt to collect as much of the breadth of available input signals as possible (See [5] for details) The latter set is recorded by us based on telecommunication application which has 25 music files and 174 noisy speech files 17 and 117 minutes in length respectively Especially the speech signals of the latter set consist of many kinds of live noises which are non-stationary with different SNR
Based on the two data sets above we build an evaluation corpus by concatenating those files
212
randomly into two columns CLN-Mix and ZX-Mix CLN-Mix contains 20 mixed files each concatenates 2 speech samples and 2 music samples which are all extracted from Scheirerrsquos database ZX-Mix uses the same way except that all samples are chosen from our records With these databases we compared 4 commonly used features with our prompted ones They are (1) RMS (2)zero crossing rate (3)variation of spectral flux (4)percentage of ldquolow-energyrdquo frames Figure 3 shows the discrimination abilities of each feature with Scheirerrsquos and our database It is clear that those 4 features show poor performance in noise situation while APD and RTPD show more robust
-05 0 05 1 15 2 250
100
200
300
400
500 RMS (Scheirer)
speechmusic
(a) -05 0 05 1 15 2 25 30
100
200
300
400
500
600RMS (our)
speechmusic
0 02 04 06 08 10
1000
2000
3000
4000
5000zerocross rate (Scheirer)
musicspeech
(b) 0 02 04 06 08 10
1000
2000
3000
4000
5000zerocross rate (our)
speechmusic
-6 -4 -2 0 2 40
100
200
300
400
500variation of spectral flux(Scheirer)
musicspeech
(c) -6 -4 -2 0 2 40
50
100
150
200
250
300
350variation of spectral flux(our)
musicspeech
0 02 04 06 08 10
200
400
600
800
1000
1200
1400percentage of low enery frames (Scheirer)
speechmusic
(d)0 02 04 06 08 1
0
100
200
300
400
500
600percentage of low enery frames (our)
speechmusic
213
001 002 003 004 005 006 0070
500
1000
1500
2000average pitch density (database)
speechmusic
(e)002 003 004 005 006 0070
500
1000
1500
2000average ptich density (zx)
speechmusic
0 2 4 6 8 100
20
40
60
80
100
120
140
musicspeech
RTPD (Scheirer)
(f)0 2 4 6 8 10
0
50
100
150
200
musicspeech
RTPD (our)
Figure 3 Histograms of different features for speechmusic discrimination (a)-(f) are RMS ZCR variation of spectral flux percentage of ldquolow-energyrdquo frames APD RTPD
4 Discrimination Experiments
In many speech and music discrimination system GMM is commonly used for classification A GMM models each class of data as the union of several Gaussian clusters in the feature space This clustering can be iteratively derived with the well-known EM algorithm Usually the individual clusters are not represented with full covariance matrices but only the diagonal approximation GMM uses a likelihood estimate for each model which measurers how well the new data point is modeled by the entrained Gaussian clusters
We use 64 components GMM to modal speech and music signal separately The feature vector consists of (1) APD (2) RTPD (3) log of variance of RMS (4) log of variance of spectral centroid (5) log of variance of spectral flux (6) 4Hz modulation energy (7) dynamic range Training data consists of the training part of Scheirerrsquos database and 8 minutes of noisy speech recorded CLN-Mix and ZX-Mix are used for evaluation
The frame length is 10ms and the analysis windows for proposed features extraction is 1 second (100 frames) with 10 new input frames each time For comparison MFCC + delta + acceleration (MFCC_D_A) feature for each frame is also examined GMM with 64 mixtures is used for speech and music respectively For classification every proposed feature vector is used to calculate the log likelihood score and correspondingly 10 frames MFCC_D_A features are used The experimental results are list in Table 1 Furthermore we also use the adjacent 10 proposed feature vectors for one decision and 100 frames of MFCC_D_A features are used as well The results are shown in Table 2
It is clear that MFCC _D_A features have good ability for discrimination with CLN-Mix data but drop distinctly with ZX-mix especially for music signals But on both data sets our
214
proposed features work well and express robustness in noise condition
Table 1 SpeechMusic Discrimination Accuracies in Every 100ms MFCC_D_A Proposed Accuracy Speech Music Speech Music
CLN-Mix 9156 8981 9378 9148 ZX-Mix 9991 6441 9419 9313
Table 2 SpeechMusic Discrimination Accuracies in Every Second MFCC_D_A Proposed Accuracy Speech Music Speech Music
CLN-Mix 9398 9511 95 9286 ZX-Mix 100 6739 100 9445
5 Conclusion
Two novel features have been presented in this paper for robust discrimination between speech and music named Average Pitch Density (APD) and Relative Tonal Power Density (RTPD) As shown in separate histograms many other commonly used features do not work in noisy condition but the novels show more robustness When combined with the other 5 robust features the accuracies of discrimination are higher than 90 The results mean that the novel features may represent some essential differences between speech and music
There are many interesting directions in which to continue pursuing this work Since the real cepstrum can show many differences between speech and music there will be other novel features which represent the holding and changing characters of pitches Whatrsquos more more researches are needed for better classification and feature combinations
References
[1] C Panagiotakis G Tziritas A SpeechMusic Discriminator Based on RMS and Zero-Crossings IEEE Transactions on Multimedia Vol7(1) February 2005
[2] O M Mubarak E A Ambikairajah J Epps Novel Features for Effective Speech and Music Discrimination Proc IEEE International Conference on Engineering of Intelligent Systems pp1-5 April 2006
[3] J E Muntildeoz-Expoacutesito S Garciacutea-Galaacuten N Ruiz-Reyes P Vera-Candeas Adaptive Network-based Fuzzy Inference System vs Other Classification Algorithms for Warped LPC-based SpeechMusic Discrimination Engineering Applications of Artificial Intelligence Vol 20(6) pp783-793 September 2007
[4] M J Carey E S Parris H Lloyd-Thomas A Comparison of Features for Speech Music Discrimination Proc IEEE International Conference on Acoustics Speech and Signal Processing Vol1 pp 149-152 March 1999
[5] E Scheirer M Slaney Construction and Evaluation of a Robust Multifeature Speech Music Discriminator Proc IEEE International Conference on Acoustics Speech and Signal Processing Vol1 pp 1331-1334 April 1997
[6] T Zhang J Kuo Audio Content Analysis for On-line Audiovisual Data Segmentation and Classification IEEE Transactions on Speech Audio Processing Vol 9 (3) pp 441-457 May 2001
215
Robust Voice Activity Detection Based on Discrete Wavelet
Transform
Kun-Ching Wang
Department of Information Technology amp Communication Shin Chien University
kunchingmailkhuscedutw
Abstract
This paper mainly addresses the problem of determining voice activity in presence of noise
especially in a dynamically varying background noise The proposed voice activity detection
algorithm is based on structure of three-layer wavelet decomposition Appling
auto-correlation function into each subband exploits the fact that intensity of periodicity is
more significant in sub-band domain than that in full-band domain In addition Teager
energy operator (TEO) is used to eliminate the noise components from the wavelet
coefficients on each subband Experimental results show that the proposed wavelet-based
algorithm is prior to others and can work in a dynamically varying background noise
Keywords voice activity detection auto-correlation function wavelet transform Teager
energy operator
1 Introduction
Voice activity detection (VAD) refers to the ability of distinguishing speech from noise and is
an integral part of a variety of speech communication systems such as speech coding speech
recognition hand-free telephony and echo cancellation Although the existed VAD
algorithms performed reliably their feature parameters are almost depended on the energy
level and sensitive to noisy environments [1-4] So far a wavelet-based VAD is rather less
discussed although wavelet analysis is much suitable for speech property SH Chen et al [5]
shown that the proposed VAD is based on wavelet transform and has an excellent
performance In fact their approach is not suitable for practical application such as
variable-level of noise conditions Besides a great computing time is needed for
accomplishing wavelet reconstruction to decide whether is speech-active or not
216
Compared with Chens VAD approach the proposed decision of VAD only depends on
three-layer wavelet decomposition This approach does not need any computing time to waste
the wavelet reconstruction In addition the four non-uniform subbands are generated from the
wavelet-based approach and the well-known auto-correlaction function (ACF) is adopted to
detect the periodicity of subband We refer the ACF defined in subband domain as subband
auto-correlation function (SACF) Due to that periodic property is mainly focused on low
frequency bands so we let the low frequency bands have high resolution to enhance the
periodic property by decomposing only low band on each layer In addition to the SACF
enclosed herein the Teager energy operator (TEO) is regarded as a pre-processor for SACF
The TEO is a powerful nonlinear operator and has been successfully used in various speech
processing applications [6-7] F Jabloun et al [8] displayed that TEO can suppress the car
engine noise and be easily implemented through time domain in Mel-scale subband The later
experimental result will prove that the TEO can further enhance the detection of subband
periodicity
To accurately count the intensity of periodicity from the envelope of the SACF the
Mean-Delta (MD) method [9] is utilized on each subband The MD-based feature parameter
has been presented for the robust development of VAD but is not performed well in the
non-stationary noise shown in the followings Eventually summing up the four values of
MDSACF (Mean-Delta of Subband Auto-Correlation Function a new feature parameter
called speech activity envelope (SAE) is further proposed Experimental results show that
the envelope of the new SAE parameter can point out the boundary of speech activity under
the poor SNR conditions and it is also insensitive to variable-level of noise
This paper is organized as follows Section 2 describes the concept of discrete wavelet
transform (DWT) and shows the used structure of three-layer wavelet decomposition Section
3 introductions the derivation of Teager energy operator (TEO) and displays the efficiency of
subband noise suppression Section 4 describes the proposed feature parameter and the block
diagram of proposed wavelet-based VAD algorithm is outlined in Section 5 Section 6
evaluates the performance of the algorithm and compare to other two wavelet-based VAD
algorithm and ITU-T G729B VAD Finally Section 7 discusses the conclusions of
experimental results
217
2 Wavelet transform
The wavelet transform (WT) is based on a time-frequency signal analysis The wavelet
analysis represents a windowing technique with variable-sized regions It allows the use of
long time intervals where we want more precise low-frequency information and shorter
regions where we want high-frequency information It is well known that speech signals
contain many transient components and non-stationary property Making use of the
multi-resolution analysis (MRA) property of the WT better time-resolution is needed a high
frequency range to detect the rapid changing transient component of the signal while better
frequency resolution is needed at low frequency range to track the slowly time-varying
formants more precisely [10] Figure 1 displays the structure of three-layer wavelet
decomposition utilized in this paper We decompose an entire signal into four non-uniform
subbands including three detailed scales such as D1 D2 and D3 and one appropriated scale
such A3
Figure 1 Structure of three-layer wavelet decomposition
3 Mean-delta method for subband auto-correlation function
The well-known definition of the term Auto-Correlation Function (ACF) is usually used for
measuring the self-periodic intensity of signal sequences shown as below
0( ) ( ) ( ) 01
p k
nR k s n s n k k p (1)
218
where p is the length of ACF k denotes as the shift of sample
In order to increase the efficiency of ACF about making use of periodicity detection to detect
speech the ACF is defined in subband domain which called subband auto-correlation
function (SACF) Figure 2 clearly illustrates the normalized SACFs for each subband when
input speech is contaminated by white noise In addition a normalization factor is applied to
the computation of SACF This major reason is to provide an offset for insensitivity on
variable energy level From this figure it is observed that the SACF of voiced speech has
more obviously peaks than that of unvoiced speech and white noise Similarly for unvoiced
speech the ACF has greater periodic intensity than white noise especially in the
approximation 3A
Furthermore a Mean-Delta (MD) method [9] over the envelope of each SACF is utilized
herein to evaluate the corresponding intensity of periodicity on each subband First a
measure which similar to delta cepstrum evaluation is mimicked to estimate the periodic
intensity of SACF namely Delta Subband Auto-Correlation Function (DSACF) shown
below
2
( )(0)
( )
M
m MM M
m M
R k mmR
R km
(2)
where MR is DSACF over an -sampleM neighborhood ( 3M in this study)
It is observed that the DSACF measure is almost like the local variation over the SACF
Second averaging the delta of SACF over a -sampleM neighborhood MR a mean of the
absolute values of the DSACF (MDSACF) is given by
1
0
1 ( )N
M Mk
R R kN
(3)
Observing the above formulations the Mean-Delta method can be used to value the number
and amplitude of peak-to-valley from the envelope of SACF So we just only sum up the four
values of MDSACFs derived from the wavelet coefficients of three detailed scales and one
appropriated scale a robust feature parameter called speech activity envelope (SAE) is
further proposed
219
Figure 3 displays that the MRA property is important to the development of SAE feature
parameter The proposed SAE feature parameter is respectively developed withwithout
band-decomposition In Figure 3(b) the SAE without band-decomposition only provides
obscure periodicity and confuses the word boundaries Figure 3(c)~Figure 3(f) respectively
show each value of MDSACF from D1 subband to A3 subband It implies that the value of
MDSACF can provide the corresponding periodic intensity for each subband Summing up
the four values of MDSACFs we can form a robust SAE parameter In Figure 3(g) the SAE
with band-decomposition can point out the word boundaries accurately from its envelope
Figure 2 SACF on voiced unvoiced signals and white noise
220
Figure 3 SAE withwithout band-decomposition
4 Teager energy operator
The Teager energy operator (TEO) is a powerful nonlinear operator and can track the
modulation energy and identify the instantaneous amplitude and frequency [7-10]
In discrete-time the TEO can be approximate by
2[ ( )] ( ) ( 1) ( 1)d s n s n s n s n (4)
where [ ( )]d s n is called the TEO coefficient of discrete-time signal ( )s n
Figure 4 indicates that the TEO coefficients not only suppress noise but also enhance the
detection of subband periodicity TEO coefficients are useful for SACF to discriminate the
difference between speech and noise in detail
221
Figure 4 Illustration of TEO processing for the discrimination between speech and noise by using periodicity detection
5 Proposed voice activity detection algorithm
In this section the proposed VAD algorithm based on DWT and TEO is presented Fig 8
displays the block diagram of the proposed wavelet-based VAD algorithm in detail For a
given layer j the wavelet transform decomposed the noisy speech signal into 1j
subbands corresponding to wavelet coefficients sets j
k nw In this case three-layer wavelet
decomposition is used to decompose noisy speech signal into four non-uniform subbands
including three detailed scales and one appropriated scale Let layer 3j
3 ( )3 1 14k mw DWT s n n N k (5)
where 3k mw defines the thm coefficient of the thk subband N denotes as window length
The decomposed length of each subband is 2kN in turn
For each subband signal the TEO processing [8] is then used to suppress the noise
222
component and also enhance the periodicity detection In TEO processing
3 3 [ ] 14k m d k mt w k (6)
Next the SACF measures the ACF defined in subband domain and it can sufficiently
discriminate the dissimilarity among of voiced unvoiced speech sounds and background
noises from wavelet coefficients The SACF derived from the Teager energy of noisy speech
is given by
3 3 [ ] 14k m k mR R t k (7)
To count the intensity of periodicity from the envelope of the SACF accurately the
Mean-Delta (MD) method [9] is utilized on each subband
The DSACF is given by
3 3 [ ] 14k m k mR R k (8)
where [ ] denotes the operator of delta
Then the MDSACF is obtained by
3 3[ ]k k mR E R (9)
where [ ]E denotes the operator of mean
Finally we sum up the values of MDSACFs derived from the wavelet coefficients of three
detailed scales and one appropriated scale and denote as SAE feature parameter given by
43
1k
kSAE R (10)
6 Experimental results
In our first experiment the results of speech activity detection are tested in three kinds of
background noise under various values of the SNR In the second experiment we adjust the
variable noise-level of background noise and mix it into the testing speech signal
61 Test environment and noisy speech database
223
The proposed wavelet-based VAD algorithm is based on frame-by-frame basis (frame size =
1024 samplesframe overlapping size = 256 samples) Three noise types including white
noise car noise and factory noise are taken from the Noisex-92 database in turn [11] The
speech database contains 60 speech phrases (in Mandarin and in English) spoken by 32 native
speakers (22 males and 10 females) sampled at 8000 Hz and linearly quantized at 16 bits per
sample To vary the testing conditions noise is added to the clean speech signal to create
noisy signals at specific SNR of 30 10 -5 dB
62 Evaluation in stationary noise
In this experiment we only consider stationary noise environment The proposed
wavelet-based VAD is tested under three types of noise sources and three specific SNR
values mentioned above Table 1 shows the comparison between the proposed wavelet-based
VAD and other two wavelet-based VAD proposed by Chen et al [5] and J Stegmann [12] and
ITU standard VAD such as G729B VAD [4] respectively The results from all the cases
involving various noise types and SNR levels are averaged and summarized in the bottom
row of this table We can find that the proposed wavelet-based VAD and Chens VAD
algorithms are all superior to Stegmanns VAD and G729B over all SNRs under various types
of noise In terms of the average correct and false speech detection probabilities the proposed
wavelet-based VAD is comparable to Chens VAD algorithm Both the algorithms are based
on the DWT and TEO processing However Chen et al decomposed the input speech signal
into 17 critical-subbands by using perceptual wavelet packet transform (PWPT) To obtain a
robust feature parameter called as VAS parameter each critical subband after their
processing is synthesized individually while other 16 subband signals are set to zero values
Next the VAS parameter is developed by merging the values of 17 synthesized bands
Compare to the analysissynthesis of wavelet from S H Chen et al we only consider
analysis of wavelet The structure of three-layer decomposition leads into four non-uniform
bands as front-end processing For the development of feature parameter we do not again
waste extra computing power to synthesize each band Besides Chens VAD algorithm must
be performed in entire speech signal The algorithm is not appropriate for real-time issue
since it does not work on frame-based processing Conversely in our method the decisions of
voice activity can be accomplished by frame-by-frame processing Table 2 indicates that the
computing time for the listed VAD algorithms running Matlab programming in Celeron 20G
CPU for processing 118 frames of an entire recording It is found that the computing time of
Chens VAD is nearly four times greater than that of other three VADs Besides the
224
computing time of Chens VAD is closely relative to the entire length of recording
Table 1 Comparison performance
Table 2 Illustrations of subjective listening evaluation and the computing time
VAD types Computing time (sec) Proposed VAD 0089 Chenrsquos VAD [5] 0436
Stegmannrsquos VAD [12] 0077 G729B VAD [4] 0091
63 Evaluation in non-stationary noise
In practice the additive noise is non-stationary in real-world since its statistical property
change over time We add the decreasing and increasing level of background noise on a clean
speech sentence in English and the SNR is set 0 dB Figure 6 exhibits the comparisons among
proposed wavelet-based VAD other one wavelet-based VAD respectively proposed by S H
Chen et al [5] and MD-based VAD proposed by A Ouzounov [9] Regarding to this figure
the mixed noisy sentence May I help you is shown in Fig 9(a) The increasing noise-level
and decreasing noise-level are added into the front and the back of clean speech signal
Additionally an abrupt change of noise is also added in the middle of clean sentence The
three envelopes of VAS MD and SAE feature parameters are showed in Figure 6(b)~Figure
225
6(d) respectively It is found that the performance of Chens VAD algorithm seems not good
in this case The envelope of VAS parameter closely depends on the variable level of noise
Similarly the envelope of MD parameter fails in variable level of noise Conversely the
envelope of proposed SAE parameter is insensitive to variable-level of noise So the
proposed wavelet-based VAD algorithm is performed well in non-stationary noise
Figure 6 Comparisons among VAS MD and proposed SAE feature parameters
7 Conclusions
The proposed VAD is an efficient and simple approach and mainly contains three-layer DWT
(discrete wavelet transform) decomposition Teager energy operation (TEO) and
auto-correlation function (ACF) TEO and ACF are respectively used herein in each
decomposed subband In this approach a new feature parameter is based on the sum of the
values of MDSACFs derived from the wavelet coefficients of three detailed scales and one
appropriated scale and it has been shown that the SAE parameter can point out the boundary
of speech activity and its envelope is insensitive to variable noise-level environment By
means of the MRA property of DWT the ACF defined in subband domain sufficiently
discriminates the dissimilarity among of voiced unvoiced speech sounds and background
226
noises from wavelet coefficients For the problem about noise suppression on wavelet
coefficients a nonlinear TEO is then utilized into each subband signals to enhance
discrimination among speech and noise Experimental results have been shown that the
SACF with TEO processing can provide robust classification of speech due to that TEO can
provide a better representation of formants resulting distinct periodicity
References
[1] Cho Y D and Kondoz A Analysis and improvement of a statistical model-based voice
activity detector IEEE Signal Processing Lett Vol 8 276-278 2001
[2] Beritelli F Casale S and Cavallaro A A robust voice activity detector for wireless
communications using soft computing IEEE J Select Areas Comm Vol 16 1818-1829
1998
[3] Nemer E Goubran R and Mahmoud S Robust voice activity detection using
higher-order statistics in the LPC residual domain IEEE Trans Speech and Audio
Processing Vol 9 217-231 2001
[4] Benyassine A Shlomot E Su H Y Massaloux D Lamblin C and Petit J P
ITU-T Recommendation G729 Annex B a silence compression scheme for use with
G729 optimized for V70 digital simultaneous voice and data applications IEEE
Communications Magazine Vol 35 64-73 1997
[5] Chen S H and Wang J F A Wavelet-based Voice Activity Detection Algorithm in
Noisy Environments 2002 IEEE International Conference on Electronics Circuits and
Systems (ICECS2002) 995-998 2002
[6] Kaiser J F On a simple algorithm to calculate the energy of a signal in Proc
ICASSP90 381-384 1990
[7] Maragos P Quatieri T and Kaiser J F On amplitude and frequency demodulation
using energy operators IEEE Trans Signal Processing Vol 41 1532-1550 1993
[8] Jabloun F Cetin A E and Erzin E Teager energy based feature parameters for
speech recognition in car noise IEEE Signal Processing Lett Vol 6 259-261 1999
[9] Ouzounov A A Robust Feature for Speech Detection Cybernetics and Information
227
Technologies Vol 4 No 2 3-14 2004
[10] Stegmann J Schroder G and Fischer K A Robust classification of speech based on
the dyadic wavelet transform with application to CELP coding Proc ICASSP Vol 1
546 - 549 1996
[11] Varga A and Steeneken H J M Assessment for automatic speech recognition II
NOISEX-92 A database and an experiment to study the effect of additive noise on
speech recognition systems Speech Commun Vol 12 247-251 1993
[12] Stegmann J and Schroder G Robust voice-activity detection based on the wavelet
transform IEEE Workshop on Speech Coding for Telecommunications Proceeding 99 -
100 1997
228
ߒଃᙃᢝհઔࢤൎ֏ऄإଙᢜอૠڤٽAssociative Cepstral Statistics Normalization Techniques for Robust
Speech Recognition
壁 Wen-hsiang TuޙዄতഏᎾՕᖂሽᖲՠᖂߓ
Dept of Electrical Engineering National Chi Nan University Taiwan aero3016ms45hinetnet
Kuang-chieh Wu ٠ܦዄতഏᎾՕᖂሽᖲՠᖂߓ
Dept of Electrical Engineering National Chi Nan University Taiwan s95323529ncnuedutw
Jeih-weih Hungݳ
ዄতഏᎾՕᖂሽᖲՠᖂߓ Dept of Electrical Engineering National Chi Nan University Taiwan
jwhungncnuedutw
ኴ ԫ೯ଃᙃᢝߓอΔڇᠧಛᛩቼՀᙃᢝயຏᄎᐙΔᇠڶ۶ڕ
யچຍᑌऱംᠲΔԫऴࠐאຟਢڼᏆઔߒऱૹរΔءᓵܛਢಾኙڼംᠲאףઔ
֏ଃإط៶Δਢݾޏ٨ऱߓԫڶխΔߒऱઔװመڇΖݾޏጟנ༽Δߒ
ᐛऱอૠࠐࢤᠧಛऱᐙΔڕࠏΚଙᢜװऄΕଙᢜଖፖฆᑇ
ய༼ଃᐛڶאױऱய౨Δڶऄᢞઃࠄ֏ऄፖอૠቹ֏ऄΔຍإ
હནΔ୶ݾ֏إຍԿጟଙᢜᐛᑇאਢܛᓵءΖࢤᠧಛᛩቼՀऱൎڇ
ԫޏ٨ߓհൎࢤऄΖ
ছࢬ༼ऱԿጟᐛᑇإ֏ݾխࢬႊشऱᐛอૠଖΔຏਢطᖞऱ
ࢨऱץࢬऱᐛޣΔڇመءװኔऱઔߒխΔམሎאشᒘ
(codebook)ഗऱޣࠐڤຍࠄอૠଖΔઌኙհছऱऄ౨ޡڶΖڇ ଃೠ(voice activityشࠌΔխݧऱᒘ৬ዌڤߜޏԫנ༽ଚݺᓵรԫຝΔءdetection VAD) ݾࠐሶಛᇆխऱଃګፖॺଃګΔ ৵شܓଃຝऱᐛ
ࢬݧڼᓿղᦞૹ(weight)Δ(codeword)ڗଡᒘޢհᒘխऱم৬ࢬழኙٵ৬ዌᒘΔࠐ৬ዌऱᒘΔᆖኔᢞኔΔאױ༼ࡨᒘڤ(codebook-based)ᐛᑇإ֏ऄऱய౨ΖڇรԲຝΔݺଚঞਢᖞٽՂ૪հᒘڤ (codebook-based)ፖᖞڤ(utterance-based)ᣊऄࢬհᐛอૠᇷಛΔ୶ࢬנᘯऱڤٽ(associative)ᐛᑇإ֏ऄΖڼᣊڤٽऱᄅऄઌለᖞڤፖᒘڤऱऄΔ౨ړޓऱய
Δڶޓயࢤګף༽چᠧಛᛩቼՀଃऱᙃᢝ壄ᒔ৫Ζ
Abstract The noise robustness property for an automatic speech recognition system is one of the most important factors to determine its recognition accuracy under a noise-corrupted environment Among the various approaches normalizing the statistical quantities of speech features is a
229
very promising direction to create more noise-robust features The related feature normalization approaches include cepsral mean subtraction (CMS) cepstral mean and variance normalization (CMVN) histogram equalization (HEQ) etc In addition the statistical quantities used in these techniques can be obtained in an utterance-wise manner or a codebook-wise manner It has been shown that in most cases the latter behaves better than the former In this paper we mainly focus on two issues First we develop a new procedure for developing the pseudo-stereo codebook which is used in the codebook-based feature normalization approaches The resulting new codebook is shown to provide a better estimate for the features statistics in order to enhance the performance of the codebook-based approaches Second we propose a series of new feature normalization approaches including associative CMS (A-CMS) associative CMVN (A-CMVN) and associative HEQ (A-HEQ) In these approaches two sources of statistic information for the features the one from the utterance and the other from the codebook are properly integrated Experimental results show that these new feature normalization approaches perform significantly better than the conventional utterance-based and codebook-based ones As the result the proposed methods in this paper effectively improve the noise robustness of speech features
ᣂΚ೯ଃᙃᢝΕᒘΕൎࢤଃᐛ
Keywords automatic speech recognition codebook robust speech feature
ԫΕፃᓵ
ᠧಛᛩቼՀΔኙಝᒭፖᇢԲऱࢤګףڇਢΔݾڤऱൎנ༽ಘᓵࢬᓵء
ଃᐛᑇऱอૠإאףࢤ֏ΔאᛩቼऱլΖխݺଚشܓමዿଙ
ᢜএᑇ(mel-frequency cepstral coefficients MFCC)ଃᐛΔٽଃೠݾ(voice activity detection VAD)[1]ፖᐛอૠଖإ֏ऱ壆ݾڍΔࠐ༼ଃᐛףڇ ֏ऄΚإಘᓵऱᐛᑇࢬᓵխءΖࢤᠧಛᛩቼՀऱൎࢤګ
ΰԫαᖞڤ(utterance-based)ᐛᑇإ֏ऄ
ऄװଙᢜڤႚอऱᖞܛ (utterance-based cepstral mean subtraction U-CMS)[2]Εᖞڤଙᢜଖፖฆᑇإ֏ऄ(utterance-based cepstral mean and variance normalization U-CMVN)[3]ፖᖞڤอૠቹ֏ऄ(utterance-based histogram equalization U-HEQ)[4]Ζڼᣊऄਢאԫᖞഗᄷ۷װጩޢԫፂᐛᑇऱอૠࢤΔࠀചᐛᑇإ֏Ζ
ΰԲαᒘڤ(codebook-based)ᐛᑇإ֏ऄ
ಝᒭଃᐛፖᇢଃᐛऱอૠזנଚ۷ጩݺܗᚥࠐᒘط៶ᣊऄਢڼ ଖΔ៶ڼചଃᐛإ֏Ζڇመװऱઔߒᇙ[5][6][7]ΔڼᣊऱऄΔץਔᒘଙᢜڤऄ(codebook-based cepstral mean subtraction C-CMS)ፖᒘװଙᢜڤଖፖฆᑇإ֏ऄ(codebook-based cepstral mean and variance normalization C-CMVN)Δயຟছԫᣊհᖞڤᐛإ֏ऄࠐऱړΖ
ՀΚڕΔ૪ݾऱޏ٨ߓԫנ༽૪ऱԲᣊऄࢬՂאᓵᖕء
ਢല٤ຝऱಝᒭறངऱᐛڤ֏ऄխ[5-7]Δᒘإᐛڤᒘװመڇ ᑇٻၦၦ֏Δຍᑌऱױڤ౨ᄎࠌխڍᒘڗਢኙᚨॺଃऱᙩଃ(silence)
230
ऱᦞૹڗଡᒘޢழΔٵΔࢤזଃᐛऱለڗᒘࠄຍࠌΔګᠧಛࢨ
ઌΔຍᑌױ౨ᄎࠌհ৵ࢬૠጩऱᐛอૠଖለլ壄ᒔΖءڇᓵխΔݺଚᚨ
ፖګԫಛᇆऱଃ(speech)נೠݾጤរೠ(voice activity detection VAD)شॺଃ(silence)ګΔ৵شࠌଃګऱᐛװ፹ᒘΔٵழΔլٵऱᒘڗᖕኙᚨऱࡨᐛᑇڍؾኒᦞૹ(weight)Δຍጟᄅऱᒘ৬ዌݧᚨޏאױՂ૪հរΔ༼ٺጟᒘڤᐛإ֏ऄऱய౨Ζ
ଚᖞݺਢݧ֏ऄΔإᐛ(associative)ڤٽԱԫᄅऄΔጠנ༽ଚݺ ڼ៶ૠጩᐛऱอૠଖΔࠐऱᐛอૠᇷಛΔشࠌࢬڤፖᖞڤছ૪հᒘٽ
ऱᣊڤፖᖞڤऱऄᒘڤٽᣊڼ֏Ζኔإചᐛऱࠐ
ऄΔ౨ሒࠋޓऱயΖױ౨ڇڂΔڤٽऱऄԱᒘڤऄխޢ
ಛᇆছଡଃᠧಛ۷ऱլᄷᒔயᚨΔ 壄ᒔΖޓऱᐛอૠଖࢬࠌ
ΖรԿݾ֏إᐛ(utterance-based)ڤଚല១տฯᖞݺհ৵ऱรԲᇙΔڇ ലᎅᄅऱဠᚵᠨຏሐᒘऱ৬مݧΔ៶ޏڼᒘڤ(codebook-based)ᐛإ֏ऄऱய౨ΖڇรխΔݺଚඖ૪ءᓵࢬᄅ༼נऱڤٽ(associative)ᐛإ֏ऄΖรնץԱءᓵհኔشࠌࢬհறտฯፖءᓵࢬ༼ऱٺጟᐛإ֏
୶ඨΖࠐآհኔፖઌᣂऱಘᓵΖ৵Δรքᓵፖݾ ԲΕᖞڤ(utterance-based)ᐛᑇإ֏ݾ Δݾ֏إऱᐛᑇشଃᙃᢝխΔᚨࢤൎڇଚ១տฯԿጟݺء ᖞڤଙᢜװऄ(utterance-based cepstral mean subtraction U-CMS)[2]Εᖞڤଙᢜଖፖฆᑇإ֏ऄ (utterance-based cepstral mean and variance normalization U-CMVN)[3]ፖᖞڤଙᢜอૠቹ֏ऄ (utterance-based cepstral histogram equalization U-HEQ)[4]Ζ ΰԫαᖞڤଙᢜװऄ (U-CMS) ଙᢜװऄ(CMS)ऱؾऱਢݦඨԫଃᐛ٨ݧխΔޢԫፂ৫ऱଙᢜএᑇ९ழଖ0Ζଖլ0ழΔݺଚലڼຏሐᠧಛڬאףೈΔڼጟऄኙຏሐᠧಛயᚨਢԫጟ១شڶऱݾΔਢڶழኙࢤګףᠧಛՂՈڶ
ԫऱயΖڍڇᑇऱऄՂΔଈݺ٣ଚലᖞଃޢԫፂऱଙᢜএᑇଖΔ
৵ലޢԫፂऱএᑇൾଖΔܛڼڕᇖᚍ৵հᄅᐛΔڼጠᖞڤଙᢜ
װऄ(utterance-based cepstral mean subtraction U-CMS)ΖᖕຍᑌऱঞΔݺଚ 1 2X n n N ԫଃࢬឯऱਬԫፂଙᢜᐛᑇ٨ݧΔڇᆖመᖞ
ऄװଙᢜڤ (U-CMS)৵Δᄅऱᆖመᇖᚍऱᐛᑇ٨ݧ 1 2
U CMSX n n N ΔᑇᖂڕڤՀقࢬΚ
1 2 U CMS XX n X n n N (21)ڤ
խ 1
1N
X
n
X nN
N ᖞଃऱଃଡᑇΖ
ڇΔڼڂ U-CMSऄխΔإאش֏ऱଖXਢطࡨᖞऱᐛࢬ٨ݧΖ
ΰԲαᖞڤଙᢜଖፖฆᑇإ֏ऄ (U-CMVN)
ଃಛᇆڇᆖመࢤګףᠧಛऱեឫհ৵Δ ଙᢜհଖࡉءଃଙᢜ
ଖհຏᄎڇژԫฝၦ(bias)ΔٵழฆᑇઌኙଃଙᢜᑇऱฆᑇߢঞຏᄎڶऱΔڼڕທګԱಝᒭፖᇢᐛऱլΔᙃᢝ
231
யΖشࠌଙᢜଖፖฆᑇإ֏ऄ(CMVN)ऱؾऱਢޢނԫፂऱଙᢜᐛᑇհଖإ֏ 0Δࠀലฆᑇإ֏ 1Δڼڕ౨Ղ૪ऱ؈టΔאሒ༼ଙᢜᐛᑇऱൎࢤΖ ଙᢜشܓଚਢ٣ݺऄՂΔ֏ऄ(CMVN)ऱإଙᢜଖፖฆᑇڇ Δ(ԫፂଙᢜএᑇଖ0ޢመ৵ऱࠌ)װऄ(CMS)װ ৵٦ല৵ऱޢ
ԫፂଙᢜএᑇೈאᑑᄷΔڼڕᄅऱᐛ٨ݧΖڇU-CMVN(utterance-based cepstral mean and variance normalization)ऄխΔ 1 2X n n N ਢԫଃऱ
ਬԫፂଙᢜᐛᑇ٨ݧΔڇᆖመU-CMVN৵Δᄅऱᐛᑇ 1 2
U CMVNX n n N ΔᑇᖂڕڤՀقࢬΚ
[ ][ ] 1 2
X
U CMVN
X
X nX n n N (22)ڤ
խ 1
1[ ]
N
X
n
X nN
2
1
1[ ]
N
X X
n
X nN
ऱଖشࢬU-CMVNխΔڇΔڼڂXፖᑑᄷ
Xઃطᖞଃऱᐛ٨ݧΖ
ΰԿαᖞڤอૠቹ֏ऄ(U-HEQ)
อૠቹ֏ऄ(HEQ)ऱؾऱΔਢݦඨאشಝᒭፖᇢհଃᐛ౨ജڶઌٵऱอૠࢤΔ៶ڼطऱངመΔᇢᐛፖಝᒭᐛհطᠧಛᐙ
Ζݮऱլګທࢬ ழሓ२ԫٵऄਢലᇢଃᐛፖಝᒭଃᐛऱᖲ
ەᖲΖءڇᓵխشࠌࢬऱەᖲԫᑑᄷኪΖ ᖕՂ૪Δݺଚ 1 2X n n N ԫଃਬԫፂଙᢜᐛᑇݧ
٨ ΙXF x X n ऱ ᖲ
XF x P X x Δ ਢ ط ᖞ հ ᐛ
1 2X n n N ΙޣNF x ەᖲΖঞᖞڤอૠቹ֏ऄ
(utterance-based histogram equalization U-HEQ)ऱᑇᖂངڕڤՀقࢬΚ 1
U HEQ N XX n F F X n (23)ڤ
խU HEQX n อૠቹ֏ऄ৵ऱᄅᐛᑇΖڤᆖመᖞܛ
ԿΕڤߜޏᒘڤᐛᑇإ֏ݾ ሎࢬشᘯऱဠᚵᠨຏሐᒘ(pseudo stereo codebooks)۷ࠐጩଃፖᠧಛଃհᐛอૠࢤΔചᐛᑇإ֏ݾΔ౨ڶய༼ᠧಛᛩቼՀଃᙃᢝ
Ζڇመװઔߒխ[5-7]נ༽ࢬհଙᢜอૠᇖᚍऄ(cepstral statistics compensation)ΔਢኙᠧಛհଃଙᢜএᑇངΔࠌᆖመང৵ऱଃଙᢜᐛհอૠଖޓઌۿ
֏ᇖᚍΖإଙᢜಾኙᠧಛଃᐛڤಝᒭଃଙᢜऱอૠଖΔຍጟ
֏إழಾኙଃፖᠧಛଃଙᢜᐛᑇٵΔঞਢڤհנ༽ࢬᓵءڇ
Ζ؆ΔڇհছऱଙᢜอૠᇖᚍऄխΔشࢬऱޢଡᒘڗ(codeword)ਢآشܓऱଃᐛಝᒭΔޢଡᒘڗऱૹઌٵΔݺڇଚޏऱऄՂΔݺଚᚨش
Աଃೠݾ(voice activity detection VAD)[1]ଃಛᇆΔലಛᇆխऱଃፖॺଃሶࠐנΔ৵شܓଃऱଃᐛࠐಝᒭᒘڗΔڼ؆Δຍࠄᒘ
ૠጩऱଃࢬڗᒘࠄຍطΔڼڂऱᦞૹ(weight)ΔٵᓿղլؾᖕොऱᐛᑇڗᐛอૠଖΔᚨᇠޓ壄ᒔΕޓ౨זଃᐛऱࢤΖኔᢞΔຍᑌऱଥڤإ౨
ऱᙃᢝΖړޓࠐΔΚݾ֏إᐛᑇ(utterance-based) ڤଚտฯԱԿጟᖞݺՂԫΔڇ U-CMSΕU-CMVNፖU-HEQΖڇຍᇙΔݺଚലشܓᄅଥإऱᒘ৬مऄΔ৬مဠᚵᠨຏሐᒘΔചԫߜޏ٨ߓऱᒘڤ(codebook-based)ᐛᑇإ֏ݾΖ
232
ΰԫαဠᚵᠨຏሐᒘհ৬ڤم ࢬਢലಝᒭறᇙڤم֏ऄ [5-7] խΔᒘհ৬إᐛᑇڤऱᒘࡨڇ ངමዿଙᢜᐛᑇհመխΔঅఎՀଃፖᠧಛໂᒵڇऱଃಛᇆΔڶ
ଃհխտᐛᑇࠄലຍࠀऱխտᐛᑇ(intermediate feature)Δࢤףઌࢤಝᒭګԫᒘ(codebook)ΔڼԫଃᒘΔՕીՂזאױଃڇխտᐛᑇऱࢤΖڇᇢଃΔኙޢԫᠧಛऱᇢଃΔছጤຝᠧ
ಛΔ৵ലຍᠧಛངՂ૪ऱխտᐛᑇΔطଃፖᠧಛڇխտᐛ
ᑇڶᒵࢤઌף(linearly additive)ऱࢤΔڼڂലຍࠄᠧಛऱխտᐛᑇऴ൷ᒵࢤઌ٣ףছಝᒭړऱଃऱޢଡᒘڗՂΔԱזᠧಛଃ(noisy speech)խտᐛᑇڇଃፖᠧಛଃזխտᐛᑇऱᒘΖ৵Δലຍڇ
խऱᒘڗངଙᢜΔࢬऱଙᢜᐛᒘΔጠဠᚵᠨຏሐᒘΖ Κڇរٵऄऱଡլمऱᒘ৬ࡨऄΔፖمऱᒘ৬ڤߜޏנ༽ࢬᓵխءڇ (1) ലಝᒭறᇙڶࢬऱଃಛᇆΔشܓ٣[1]ࢬ༼հଃೠݾ(voice activity detection VAD)ೠנଃ(speech)ፖᙩଃ(silence)ګΔ৵شࠌଃຝऱխտᐛᑇࠐಝᒭଃऱᒘΖڇࡨऱऄᇙΔႛਢآشࠌՂ૪ऱ
ଃಛᇆհխտᐛಝᒭᒘΖ (2) լٵऱᒘڗᖕොऱᐛၦΔਐլٵऱᦞૹ(weight)ΔܛොለڍၦᐛऱᒘڗΔࢬऱᦞૹՈყՕΔڼრထޢଡᒘڗऱנᖲࠀլઌٵΖຍࠄᦞૹشאױ
ޢऱऄᇙΔࡨڇ壄ᄷऱᐛอૠၦΖޓ֏ऄᇙΔ۷إ৵ᥛऱᐛอૠܗᚥࠐ
ଡᒘآڗᓿղᦞૹΔឆԱޢଡᒘڗऱנᖲਢ(uniform)ऱΖ
መΚمဠᚵᠨຏሐᒘհ৬ڼଚᇡ૪ݺՀΔא
ಝᒭறנ[1]ሶݾԫறΔຘመଃೠޢଚ٣ലறխݺ խΔ᥆ଃऱຝΔ৵ᆖطමዿଙᢜᐛᑇ (mel-frequency cepstral coefficients MFCC)ឯऱছתຝΔലڼ᥆ଃऱຝΔངګԫխտᐛଖΔՈਢנխտᐛමዿៀᕴհᙁڼ٨Δݧၦ(intermediate feature vector)ऱٻᒷ֏৵հᒵࢤᢜ(linear spectrum)ΔຍطࠄறࢬऱխտᐛٻၦΔຘመٻၦၦ֏(vector quantization VQ)৵Δ৬مԫץM ଡᒘڗऱႃٽΔא | 1x n n M ࠐ
قΔٵழΔኙᚨऱᦞૹ | 1nw n M ΖຍڇխտᐛᑇՂऱଃ
ᒘհڶࢬᒘڗΔط٦MFCCឯऱ৵תຝངଙᢜΔڕՀقࢬڤΚ
[ ] [ ]x n f x n (31)ڤ
խ ()f ΔڼڂΔݧངז | 1n
x n w n M ངଙᢜऱᒘᦞૹ
ଖΔຍਢଃऱଙᢜᒘᦞૹଖΖ
ኙᚨᇠم৬ࠐΔڗխտᐛᑇՂऱᒘڇଃط៶ଚݺᠧಛଃΔڇ ᠧಛհᇢଃऱᒘΖݺଚലޢԫᇢଃ۷ऱᠧಛΔڇխտᐛᑇ
ΰᒵࢤᢜαՂشԫٻၦ | 1n p p P խտڇଃፖᠧಛطΔقࠐ
ᐛᑇՂڶᒵࢤઌףऱࢤΔڼڂᠧಛଃऱᒘױڗګقՀڤΚ
( 1)
| m n P p
y m x n n p (32)ڤ
৵Δᣊ(31)ڤۿΔݺଚലy m ᆖط MFCC ឯ৵תຝངଙᢜΔڕՀڤ Κقࢬ
( )y m f y m (33)ڤ
233
ଡyޢΔ؆ڼ m ऱᦞૹଖmv ঞΚ
( 1)
n
m m n P p
wv
P(34)ڤ
Δyڼڂ m հᦞૹΰܛmv αਢኙᚨऱଃᒘڗx n հᦞૹ
nw ऱ
1
PΔխP ਢ
ᠧಛٻၦ [ ]n p ऱଡᑇΖਚ [ ] | 1m
y m v m MP ਢזڼᠧಛଃڇଙᢜ
ՂऱᒘᦞૹଖΖ n
x n w ፖ m
y m v ຍזಝᒭଃፖᠧಛ
ᇢଃऱᒘڗΔݺଚጠհဠᚵᠨຏሐᒘΖࢬᘯဠᚵऱრ৸Δਢڂᠧಛଃऱᒘ
൷ऱΖࢬଃᒘፖᠧಛ۷ጩଖطᠧಛଃΔਢᆖطլਢऴ൷ࠀΰԲαᒘڤᐛᑇإ֏ݾ ຍԫᆏխΔݺଚലտฯᒘڤᐛᑇإ֏ݾΖڇছམ༼Δڼᣊإ֏ݾΔਢٵழಾኙଃፖᠧಛଃଙᢜᐛᑇΖڇຍᇙऱᒘڤᐛ
ᑇإ֏ݾΔਢ៶ڇطছԫᆏխ૪ऱဠᚵᠨຏሐᒘΔࠐ৬مᐛհอૠၦΔ
ኙᐛإ֏ΖຍԿጟᐛᑇإ֏ݾΚଙᢜװऄ(CMS)Εଙᢜଖፖฆᑇإ֏ऄ(CMVN)Εፖଙᢜอૠቹ֏ऄ(HEQ)ΖኙCMSፖCMVN૪հᒘፖᦞૹࢬছԫᆏشܓଚݺΔߢ
mx m w ፖ
my m v Δૠጩנז
ଃፖᠧಛଃᐛऱ२ۿอૠଖΔڕՀقࢬڤΚ
1
( [ ])
N
X i n i
n
w x n 222
1
[ ]
N
X i n X ii
n
w x n (35)ڤ
1
( [ ])
NP
Y i m i
m
v y m 222
1
[ ]
NP
Y i m Y ii
m
v y m (36)ڤ
խ( )iu ၦuհรiፂΔٻრז
X iፖ 2
X iזଃᐛٻၦxรiፂऱ
ଖፖฆᑇΙY iፖ 2
Y iזᠧಛଃᐛٻၦyรiፂऱଖፖฆᑇΔࡉհছ
[5-7]խऱऄฆڇΔݺڼଚشࢬऱอૠଖ(ଖፖฆᑇ)ਢףאᦞ(weighted average)ऱࢬڤݮΔॺ[5-7]խհ(uniform average)ऱڤݮΖ
ᒘڤଙᢜװऄ(codebook-based cepstral mean subtraction C-CMS)Δਢኙଙᢜᐛհଖإ֏ΔᑇᖂڕڤقՀΚ
( ) ( ) ( ) ( ) i i X i i i y ix x y y (37)ڤ
խxፖyଃᐛxፖᠧಛଃᐛyڇᆖመ C-CMS৵ऱᄅᐛଖΖ
ᒘڤଙᢜଖፖฆᑇإ֏ऄ (codebook-based cepstral mean and variance normalization C-CMVN)Δਢಾኙଙᢜᐛհଖፖฆᑇإ֏ΔᑇᖂڕڤقՀΚ
( ) ( )( ) ( )
i X i i Y i
i i
X i Y i
x yx y (38)ڤ
խxፖyଃᐛxፖᠧಛଃᐛyᆖመ C-CMVN৵ऱᄅᐛଖΖ
৵Δᒘڤଙᢜอૠቹ֏ऄ(codebook-basedcepsteralhistogram equalization C-HEQ)Δഗءऄਢشܓ
nx n w ፖ
my m v ᒘૠጩנଃᐛ
ፖᠧಛଃᐛհޢԫፂհ२ۿऱᖲ(probability distribution)Δ৵ޣԫངࠤᑇΔࠌԲհޢԫፂᐛᑇհᖲઃሓ२ਬԫ٣ࠃᆠհەᖲΖ
234
Հ૪Κڕऄ
ݺଚط៶ڇᒘ [ ]n
x n w ৬مร iፂଃᐛ( )ix ऱᗨയ৫ࠤᑇΔطᒘ
ߪءრထᠦཋऱڤݮΔૉݺଚร ix ኙᚨհᙟᖲᑇߪء iX Δঞ
iX ऱᖲ
ᔆၦࠤᑇ(probability mass function)شױՀڤقΚ
[ ] i ni
P X x n w (39)ڤ
iX ऱᖲയ৫ࠤᑇ(probability density function pdf)ΔאױܛՀڤقΚ
1
( ) [ ] i
M
X n i
n
f x w x x n (310)ڤ
խ ۯ౧ᓢ(unit impulse)ࠤᑇΔਚiX հᖲΔࢨጠᗨᖲയ৫ࠤᑇ
(cumulative density function)ΔՂڤ ( )iXf x հᗨΔڕقՀΚ
1
( ) [ ] i
M
X i n i
n
F x P X x w u x x n (311)ڤ
խu x ޡۯၸࠤᑇ(unit step function)ΔᆠΚ
1 0
0 0
x
x
u x (312)ڤ
Δรڼڂ iፂଃᐛ ix հᖲঞ(311)ڤطױऱ ( )iXF x قΔٵΔ៶طᒘ
[ ]m
y m v ৬مհรiፂᠧಛଃᐛiy ऱᖲطױՀڤقΚ
1
[ ] i
MP
Y i m i
m
F y P Y y v u y y m (313)ڤ
ऄՂ૪ط ( )iXF x ፖ
iYF y հ৵Δᖕଙᢜอૠቹ֏ऄ(HEQ)ऱΔݺଚܓ
֏รiፂհಝᒭଃᐛإڤՀش ix ፖᇢᠧಛଃᐛ iy Κ
1
iiN Xi
x F F x (314)ڤ-
1( )
iN Y ii
y F F y (315)ڤ
խNF ԫەᖲ(ຏᑑᄷኪ)Δ 1
NF
NF ऱࠤᑇΔxፖyঞ
ᆖC-HEQإ֏৵ऱᄅᐛଖΖ ጵאٽՂࢬ૪Δڇመװऱᒘڤᐛᑇإ֏ݾխΔشࢬऱᒘڗਢشܓآࡨհಛᇆᐛಝᒭΔޢଡᒘڗऱૹઃઌٵΔڇຍᇙנ༽ࢬऱڤߜޏᒘ
৬مऄՂΔݺଚᚨشଃೠݾ٣ലଃಛᇆխऱଃፖॺଃሶ
ᓿؾොऱᐛᑇࢬڗऱᒘٵΖ൷ထΔᖕլڗଃऱᐛಝᒭᒘشܓΔ৵ࠐנ
ղઌኙհᦞૹ(weight)ΔڼڂΔຍࠄᒘࢬڗૠጩנऱଃᐛอૠଖࢨᖲΔᚨᅝ୶ऱᒘࢬᒘڤߜޏڼط៶รऱኔխΔലᢞڇΖࢤז壄ᒔޓ
ऱᙃᢝயΖړޓ֏ऄΔ౨ᛧإᐛᑇڤ
Εڤٽᐛᑇإ֏ݾ ছԫ༼Δឈᒘڤᐛᑇإ֏ऄհཏሙᖞڤऱऄࠐऱړΔ
235
ໂԱܛழሎጩऱᚌរΔױ౨ऱរڇᠧಛᇷಛլߩΔᖄીࢬऱᠧಛଃᒘ
լജ壄ᄷΖڼڂΔݺءଚಾኙՂ૪រΔ༼ڤٽנऱᐛᑇإ֏ݾΔ១
հᐛอૠޣࢬऄڤፖᖞڤտฯऱᒘࢬԱհছٽऄխΔᖞࠄຍڇଚݺᎅΔࠐ
ࢤΔݦඨޓ壄ᒔऱอૠଖࠐചٺጟᐛإ֏ऄΖຍࠄऄΔݺଚอጠٽ
ऄװଙᢜڤٽଚኙݺՀᆏΔא֏ऄΖإᐛᑇ(associative)ڤ(associative CMS A-CMS)Εڤٽଙᢜଖፖฆᑇإ֏ऄ(associative CMVN A-CMVN)ፖڤٽଙᢜอૠቹ֏ऄ(associative HEQ A-HEQ)տฯΖ ΰԫα ଙᢜଖፖฆڤٽऄ(associative CMS A-CMS)ፖװଙᢜڤٽᑇإ֏ऄ(associative CMVN A-CMVN)
ຍԫᆏխലտฯ A-CMS ፖ A-CMVN ጟᐛᑇإ֏ऄΖݺଚ៶طԫᑇଖ ऱᓳᖞΔᔞᅝچᖞٽᒘፖᖞᐛհอૠᇷಛΔݦඨ౨ሒለࠋհᙃᢝயΖ
ᖞ(utterance)ऱᐛߢΔ1 2
NX X X X ԫಝᒭࢨشᇢش
ଃࢬڇឯऱਬԫፂଙᢜᐛᑇ٨ݧΔঞᖞڤհᐛऱଖፖฆᑇطױ
ՀڤૠጩΚ
1
1
N
u i
i
XN
(41)ڤ
22
1
1
N
u i u
i
XN
(42)ڤ
խuᖞڤհᐛଖΔ 2
uᖞڤհᐛฆᑇΔN ᖞଃऱଃᑇΖ
ᒘՂऱᐛΔڇ 1 2
MC C C C ٵԫଃኙᚨऱٺᒘڗ
(codewords)ऱਬԫፂ(ፖছԫࢬ૪հፂଖઌٵ)հႃٽΔঞڼଃᐛհᒘڤऱଖፖฆᑇطױՀڤૠጩΚ
1
M
c j j
j
w C (43)ڤ
2 2 2
1
M
c j j c
j
w C (44)ڤ
խcᒘڤհᐛଖΔ 2
cᒘڤհᐛฆᑇΔ
jw ޢԫᒘࢬڗኙᚨऱ
ᦞૹΔM ᒘڗᑇؾΖ
ऱᐛᑇհشࠌࢬऄ(associative CMS A-CMS)խΔװଙᢜڤٽΔڼڂଖ
aΔطױՀڤૠጩΚ
1a c u
(45)ڤ
խuፖ
c(41)ڤڕፖقࢬ(43)ڤΔ ԫᦞૹଖΔ0 1Ζ
ΚقױΔA-CMS৵ऱᄅᐛᑇΔڼڂ
A-CMS 1 i i aX X i N (46)ڤ
ऱشࠌࢬ֏ऄ(associative CMVN A-CMVN)խΔإଙᢜଖፖฆᑇڤٽᐛᑇհଖ
aፖฆᑇ 2
aΔطױՀڤૠጩΚ
1 a c u
(47)ڤ
236
2 2 2 2 2 21
a c c u u a (48)ڤ
խuΕ
cΕ 2
uፖ 2
c(41)ڤڕΕ(43)ڤΕ(42)ڤፖقࢬ(44)ڤΔ ԫᦞૹଖΔ
0 1Ζ A-CMVN৵ऱᄅᐛᑇΔױقΚ
A-CMVN i a
i
a
XX (49)ڤ
Δנױ(48)ڤፖ(47)ڤΕ(45)ڤط ऱՕԱڤٽऄխΔشࠌᒘڤอ
ૠၦፖᖞڤอૠၦऱࠏΖᅝ 1ழΔA-CMSࢨA-CMVNܛࡨհᒘڤCMS(C-CMS)ࢨᒘڤCMVN(C-CMVN)ΔઌچΔᅝ 0ழΔA-CMSࢨA-CMVNܛࡨհᖞڤCMS(U-CMS)ࢨᖞڤCMVN(U-CMVN)Ζ
ΰԲαڤٽଙᢜอૠቹ֏ऄ(associative HEQ A-HEQ)
อૠቹ֏ऄ(associative histogram equalizationڤٽଚലտฯݺຍԫᆏխΔڇ A-HEQ)ΔᣊۿհছऱᨠΔݺଚᇢထᖞٽԫ(utterance)ᐛኙᚨհᒘٽڗ(codebook)ऱอૠᇷಛΔ৵৬ዌנԫזڼᐛऱᖲXF x P X x Δא HEQऄ֏ᐛشࢬΖאՀΔݺଚ૪ A-HEQചޡᨏΚ
ਬԫৱإ֏ऱհԫፂऱᐛ٨ݧ1 2
NX X X Δ խN ڼ
قΔڗԫፂऱᒘٵ٨հᐛᑇΔኙᚨհݧ1 2
MC C C Δᦞૹ
1 2
Mw w w ΔխM ᒘڗᑇؾΖଈ٣Δݺଚԫᑇ )0 Δڼᑇ
ؾԫᑇسଚขݺΖ൷ထΔࠏᇷಛऱڤᖞشࠌᇷಛઌኙڤᒘشࠌԱז N
ऱᄅᐛkC Δڼᄅᐛਢطᒘڗ
mC ᖕᦞૹଖ
mw ऱΔᄅᐛم৬ࢬ
kC խ
ڶ [ ]m
N w ଡᐛऱଖࡉmC )Δٵ٤ઌݙ [ ]
mN w ז
mN w ඍնԵ৵ऱ
ଖ)ΔངߢհΔᄅᐛkC ԫᖞٽԱᦞૹଖऱᄅᒘڗΔᅝᒘڗ
mC ᦞૹଖ
mw
ழΔᄎڇᄅᐛkC խנ[ ]
mN w ٽႃڗΔᒘڕࠏΔڻ 357 Δኙᚨ
հᦞૹ 0205 03 ΔঞᅝᄅᐛkC ऱᑇ 20 ழΔ
kC ץਔԱ 4 ଡ
3 (20 02 4)Δ10 ଡ 5 (20 05 10)ፖ 6 ଡ 7 (20 03 6)ΔڼڂΔkC ܛ
10 6
3 3 3 35 5 5 5 7 7 7 7
ˇଡ ଡ ଡ
ΰኔᎾՂΔطඍնԵऱᣂএΔ ৵ऱᄅᐛkC
ᑇױ౨լᄎړਢ N ΔړܛᐛᑇؾN ऱ αΖ
൷ՀࠐΔݺଚലᐛ1 2
NX X X ፖזᒘڗऱᄅᐛ
1 2
NC C C
ᜤದࠐΔٵ٥ԫזڼᐛऱᖲΚ
1 1
1
1
NN
X n k
n k
F x u x X u x CN
(410)ڤ
৵Δشܓ HEQऱΔݺଚലᐛإ֏ΔڕՀقࢬڤΚ
A-HEQ 1
N Xx F F x (411)ڤ
խNF ەհᖲΔxࡨᐛᑇ(ܛছ༼ऱ
1 2
NX X X )Δxܛ
A-HEQऄࢬհᄅᐛᑇΖ
237
Δଃᐛհᖲנױ(410)ڤطXF x ᖞᐛط
nX ፖᄅᒘڗᐛ
kC ٥
પؾNΔ৵ᑇؾΔছᑇٵ N Δڼڂᑇ ՕԱA-HEQխΔᄅᒘڗᐛ
kC ኙ
XF x ऱᐙ৫Δᅝ 0ழΔઌᅝᒘڗऱᇷಛ٤ݙฃΔA-HEQ
HEQऄ(U-HEQ)Δᅝڤտฯհᖞࢬ٣ܛ ৰՕ) ழΔ٣ऱ
ᐛnX հᇷಛঞઊฃΔঞڼழA-HEQܛ२ࢬ٣տฯհᒘڤHEQ
(C-HEQ)ΖտࢬԱছٽழᖞٵݾΔຍᣊݾ֏إᐛڤٽଚտฯԱݺຍԫխΔڇ ฯհᖞڤፖᒘݾڤشࢬऱᐛอૠᇷಛΔຘመ(45)ڤΕ(47)ڤΕ(48)ڤፖ(410)ڤխհᑇ ፖ ऱᓳᖞΔݺଚאױᐘࢬچࢤհอૠᇷಛऱࠏΖڇՀԫऱ
ኔΔݺଚലຍᣊڤٽᐛإ֏ݾ౨ړޓࠐऱଃᙃᢝ壄ᒔ৫Ζ
նΕᙃᢝኔፖઌᣂಘᓵ Δڤอய౨ऱေ۷ߓऱଃᇷறፖشࠌࢬᓵՂءڇਢտฯࡨء ৵ऱփ
୲ءᓵࢬ༼հٺጟൎࢤଃᐛᑇݾհᙃᢝኔΔઌᣂፖಘᓵΖ ΰԫαଃᇷற១տ ᄎऱଃᇷறᑛሽॾᑑᄷشࠌᓵء (European Telecommunication Standard InstituteΔETSI)ऱAURORA 2ଃᇷற[8]Δփ୲ਢຑᥛऱᑇڗڗΔխਢאભഏߊڣګՖࢬᙕ፹ऱᛩቼຑᥛᑇڗଃΔ ৵ףՂԱԶጟլٵऱࢤګף
ᠧಛፖຏሐயᚨΖຍࢤګףࠄᠧಛΚچՀᥳ(subway)ΕԳऱትᠧᜢ(babble)Ε(car)Ε୶ᥦᄎ(exhibition)Ε塊ᨚ(restaurant)Εဩሐ(street)Εᖲ(airport)Ε־(train station)ᛩቼᠧಛ٥ૠԶጟΔຏሐயᚨڶጟΔG712ፖMIRS[9]Ζ ᓵءطऱᇢᛩቼΔٵऱಝᒭᛩቼԿጟլٵጟլڶAURORA 2ᇷறᇙڇ ԫհԫጟಝᒭᛩቼፖጟᇢᛩቼΖشࠌຍᇙΔڇڼڂᠧಛಘᓵΔࢤګףಾኙΰԲαኔ ऱᐛᑇ13ፂΰร0ፂร12ፂαऱමዿଙᢜএᑇشࠌࢬᓵխء (mel-frequency cepstral coefficients MFCC)ΔףՂԫၸࡉԲၸၦΔ٥39ፂऱᐛᑇΖᑓীऱಝᒭਢشࠌឆ៲ױڤᑓীՠ(Hidden Markov Model ToolkitΔHTK)[10] ᑓী(oh zero oneڗ11ଡᑇسಝᒭΔขࠐhellip nine)ፖԫଡᙩଃᑓীΔޢଡᑇڗᑓীץ16ଡणኪΔޢଡणኪ20ץଡཎയ৫ٽΖ ΰԿαٺጟൎݾࢤհᙃᢝፖಘᓵ ֏ऄऱᙃᢝإᐛڤհᒘߜޏ 1 ଙڤᒘشΔᚨݧمհᄅऱᒘ৬נ༽ࢬᓵءଚലտฯݺຍԫᆏխΔڇ ᢜװऄ(C-CMS)Εᒘڤଙᢜଖፖฆᑇإ֏ऄ(C-CMVN)ፖᒘڤଙᢜอૠቹ֏ऄ(C-HEQ)ऱᙃᢝΖݺଚ೯ࢬሎشऱᒘڗᑇؾM Δ16Ε64ፖ256ΔࠐᨠயᚨΖኙᠧಛऱ۷ଖ [ ]1n p p P Δݺଚਢޢאԫ
ᇢଃऱছ10ଡଃᠧಛଃऱזΔܛ 10P ΖאՀΔԲΕԿፖ
ᄅऱᒘ৬مࢬݧհC-CMSΕC-CMVNፖC-HEQڇլٵᒘᑇM հՀࢬऱᙃᢝΰ20dBΕ15dBΕ10dBΕ5dBፖ0dBնጟಛᠧՀऱᙃᢝαΔARፖRRઌለഗኔհኙᙑᎄ(absolute error rate reduction)ࡉઌኙᙑᎄףխࠄຍڇ(relative error rate reduction)Ζ ᑑಖ(C-CMS C-CMVNࢨ Δঞࡨᒘ৬مࢬ[7-5]ݧኙᚨհC-CMSࢨC-CMVNऄΔU-CMSΕU-CMVNፖU-HEQᖞڤCMSΕCMVNፖHEQΖॵԫ༼ऱਢΔطࡨᒘᐛإ֏ऄऱ[5-7]ᇙΔ༼C-CMSፖC-CMVNΔآࠀտฯC-HEQΔڇڼڂխΔݺଚ
238
ለΖHEQ(U-HEQ)ऱய౨ڤലᄅऱC-HEQፖᖞ
ԫΕኔشࢬհAurora-2றઌᣂᇷಛ
AURORA2ଃᇷற
ᑌ 8kHz
ଃփ୲ ᑇڗ 0Д9(zero one two three four five six seven eight nine oh)Δ٥ 11ଡଃΖ
ଃ९৫ ޢԫଃץլመԮଡऱᑇڗ
ಝᒭற ᑇΚ8440 ኹᗨࢤᠧಛΚG712ຏሐΙࢤګףᠧಛΚࢤګףᠧಛ
Aᠧಛᛩቼ Bᠧಛᛩቼ ᇢற
ᑇΚ28028 ኹᗨࢤᠧಛΚG712ຏሐ
ᠧಛΚࢤګף Հᥳᠧಛ(subway)چԳऱትᠧᜢᠧಛ(babble)
ᠧಛ(car)୶ᥦ塢ᠧಛ(exhibition)
ᠧಛൎ৫(signal-to-noise ratio SNR)ΚcleanΕ20dBΕ15dBΕ10dBΕ
5dBΕ0dB
ᑇΚ28028 ኹᗨࢤᠧಛΚG712ຏሐ
ᠧಛΚࢤګף塊ᨚᠧಛ(restaurant) ဩሐᠧಛ(street) ᖲᠧಛ(airport)
ᠧಛ(train station)־ᠧಛൎ৫(signal-to-noise ratio
SNR)ΚcleanΕ20dBΕ15dBΕ10dBΕ5dBΕ0dB
ൕຍԿଡऱΔݺଚױᨠኘՀ٨រΚ
Ϥ1 CMS ऄߢΔࡨհ C-CMS(C-CMS)ઌኙഗኔޡለ(ڇڕN =256ՀΔڇ Set AՀ༼Ա 600Δڇ Set BՀ༼Ա 741)ΔயᖞڤCMS(U-CMS)ࠐऱΔΔݺଚנ༽ࢬऱᄅ C-CMSΔঞࠐထऱڇڕ)ޡN =256ՀΔڇ Set AՀ༼Ա 954Δڇ Set BՀ༼Ա 1370)ΔڼطᢞኔΔݺଚشࢬऱᄅऱᒘ৬ዌݧᒔኔ౨ڶய༼ C-CMS ऱயΔயࠀլᄎᙟထᒘڗᑇؾऱՕΔڶऱ֏Ζயڇ Set AՀᚌ U-CMSΔڇ Set BՀঞฃ U-CMSΔຍױ౨ڇڂΔC-CMS ڇᠧಛ۷Δຍԫଃছଡଃشࠌ Set B ॺڼ(non-stationary)ᠧಛᛩቼխਢለլ壄ᒔऱΖ
Ϥ2 CMVN ऄߢΔࡨհ C-CMVN(ܛ C-CMVN)ઌኙഗኔឈբڶԱլᙑऱᙃᢝ༼(ڇڕM =256 ՀΔڇ Set A Հ༼Ա 1475Δڇ Set B Հ༼Ա1846)Δਢઌለᖞڤ CMVN(U-CMVN)ߢΔڇM =16ፖM =64ՀΔயຟ U-CMVN ᝫΔΔݺଚנ༽ࢬհᄅऱ C-CMVNΔঞڶऱޡΔᓵڇM =16ΕM =64 Mࢨ =256 ՀΔயຟࡨऱ C-CMVN ᝫړΔຟᚌU-CMVN(ႛڇM =16ழΔSet Bհᙃᢝฃ U-CMVN)ΔڼطᢞኔΔݺଚشࢬऱᄅऱᒘ৬ዌݧᒔኔ౨ڶய༼ C-CMVN ऱயΔயࠀլᄎᙟထᒘڗऱՕΔڶऱ֏Ζ
Ϥ3 HEQऄߢΔC-HEQٵᑌՈ౨ڶய༼ᙃᢝΔᓵڇ AᠧಛᛩቼՀࢨ
239
BᠧಛᛩቼՀΔ ᙃᢝຟ U-HEQࠐΔݺଚංױڂ౨ڇΔC-HEQທڂΔזᠧಛଃऱԫᇢଃऱছଡଃޢאᠧಛऱ۷ՂΔਢڇ
ګऱᠧಛଃᒘլജ壄ᄷΔທࢬΔᖄીߩᠧಛᇷಛլګ C-HEQ ᙃᢝU-HEQᝫऱΖ
ԲΕU-CMSΕࡨC-CMS(C-CMS)ΕፖᄅC-CMSऱᙃᢝ()
Method Set A Set B average AR RR Baseline 7192 6779 6986ʳU-CMS 7937 8247 8092ʳ 1107ʳ 3671ʳ
C-CMS(M=16) 7421 7081 7251ʳ 265ʳ 881ʳC-CMS(M =64) 7403 7074 7239ʳ 253ʳ 839ʳC-CMS(M =256) 7792 7520 7656ʳ 671ʳ 2224ʳ
C-CMS(M =16) 7904 7956 7930ʳ 945ʳ 3133ʳC-CMS(M =64) 8079 8019 8049ʳ 1064ʳ 3528ʳC-CMS(M=256) 8146 8149 8148ʳ 1162ʳ 3855ʳ
ԿΕU-CMVNΕࡨC-CMVN(C-CMVN)ΕፖᄅC-CMVNऱᙃᢝ
Method Set A Set B average AR RR Baseline 7192 6779 6986ʳ
U-CMVN 8503 8556 8530ʳ 1544ʳ 5122ʳC-CMVN(M =16) 8444 8240 8342ʳ 1357ʳ 4500ʳC-CMVN(M=64) 8413 8153 8283ʳ 1298ʳ 4304ʳC-CMVN(M=256) 8667 8625 8646ʳ 1661ʳ 5508ʳ
C-CMVN(M=16) 8541 8521 8531ʳ 1546ʳ 5127ʳC-CMVN(M=64) 8692 8681 8687ʳ 1701ʳ 5643ʳ
C-CMVN(M=256) 8710 8732 8721ʳ 1736ʳ 5757ʳ
ΕU-HEQፖᄅC-HEQऱᙃᢝ
Method Set A Set B average AR RR Baseline 7192 6779 6986ʳU-HEQ 8700 8833 8767ʳ 1781ʳ 5908ʳ
C- HEQ(M=16) 8403 8446 8425ʳ 1439ʳ 4774ʳC- HEQ(M=64) 8632 8590 8611ʳ 1626ʳ 5392ʳC-HEQ(M=256) 8622 8607 8615ʳ 1629ʳ 5404ʳ
֏ऄհᙃᢝإᐛᑇڤٽ2հݾ֏إᐛᑇ(associative)ڤٽհנ༽ࢬᓵءଚലտฯݺຍԫᆏխΔڇ
ᙃᢝΔຍԿጟݾڤٽଙᢜװऄ(associative CMS A-CMS)Εٽ֏อૠቹڤٽ֏ऄ(associative CMVN A-CMVN)ፖإଙᢜଖፖฆᑇڤऄ(associative histogram equalization A-HEQ)Ζڇ A-CMSΕA-CMVNፖ A-HEQԿጟإ
240
֏ݾխΔڇطլٵऱᒘڗᑇؾN ՀΔขࠋسᙃᢝऱ ଖ((45)ڤڕΕ(4-7)ፖ(48)խࢨ(قࢬ ଖ((410)ڤڕխقࢬ)լጐઌٵΔאڇڼڂՀऱኔᙃᢝխΔݺଚऱNٵլڇܧ ଖழΔࢬขࠋسᙃᢝհ ଖࢨ ଖհΖ
ଈ٣Δն A-CMSڇᒘڗᑇؾN 16Ε64ፖ 256ՀΔࢬऱࠋᙃᢝΔԱለದߠΔݺଚՈലԲխऱഗءኔΕC-CMS(M =256)ፖ U-CMSऱᙃᢝڇ٨խΖൕڼխΔݺଚאױᨠኘאՀጟݮΚ
Ϥ1 ᑇMڗᒘڇΔᓵߢኔءഗऄ(A-CMS)ઌለװଙᢜڤٽ =16Ε64ፖ 256ՀΔᙃᢝઃڶՕऱޡΔԿڇ AᠧಛᛩቼՀڶ 1186Ε1130ፖ 1098ऱᙃᢝ༼Δڇ B ᠧಛᛩቼՀڶ 1776Ε1682ፖ 1683ऱᙃᢝ༼Δױڼطנ A-CMSڶլᙑհᐛൎ֏யΖ
Ϥ2 A-CMSٺڇጟլٵऱᒘڗᑇN հՀΔᙃᢝઃ C-CMSፖ U-CMSࠐړΔխڇN =16ழ౨ࠋڶऱயΔڇ Aᠧಛᛩቼፖ BᠧಛᛩቼՀհᙃᢝ 8378ፖ 8555Δઌለ C-CMS M =256 ᙃᢝΔA-CMSࠋհࢬڇ A ᠧಛᛩቼፖ B ᠧಛᛩቼՀޡԱ ࡉ232 406ΔຍޡࠄຟقԱA-CMSᚌ C-CMSΖ৵ઌለ U-CMSΔA-CMSڇ Aᠧಛᛩቼፖ BᠧಛᛩቼՀᙃᢝאױ༼ ࡉ441 308ΖطڼڂኔᑇᖕխאױᢞΔઌኙ C-CMSፖ U-CMSߢΔA-CMSຟאױለړऱᙃᢝΔຍױ౨ਢڂ A-CMSٵழᖞٽԱ C-CMSፖ U-CMSشࢬऱอૠᇷಛΔޓאࢬ౨ڶயޏଃڇᠧಛՀऱൎࢤΖ
նΕU-CMSΕᄅC-CMSፖA-CMSऱᙃᢝ
Method Set A Set B average AR RR Baseline 7192 6779 6986ʳU-CMS 7937 8247 8092ʳ 1107ʳ 3671ʳ
C-CMS(M=256) 8146 8149 8148ʳ 1162ʳ 3855ʳA-CMS(M=16 =05) 8378 8555 8467ʳ 1481ʳ 4913ʳA-CMS(M=64 =06) 8322 8461 8392ʳ 1406ʳ 4664ʳA-CMS(M=256 =06) 8290 8462 8376ʳ 1391ʳ 4613ʳ
൷ထΔքA-CMVNڇᒘڗᑇؾM 16Ε64ፖ256ՀΔࢬऱࠋᙃᢝ
ΔڇխΔݺଚՈנ٨ԿխऱഗءኔΕC-CMVN(M =256)ፖU-CMVNऱᙃᢝࠎאለΖൕڼխΔݺଚאױᨠኘאՀጟݮΚ
Ϥ1 Mؾᑇڗᒘڇ֏ऄ(A-CMVN)إଙᢜଖፖฆᑇڤٽ =16Ε64ፖ 256ՀΔઌለഗءኔߢΔᙃᢝઃڶՕऱޏΔຍԿጟ A-CMVNڇ AᠧಛᛩቼՀڶ 1619Ε1608ፖ 1543ऱᙃᢝ༼Δڇ B ᠧಛᛩቼՀڶ2118Ε2077ፖ 2026ऱᙃᢝ༼Δאױڼط A-CMVNᒔኔ౨ࢤګףᠧಛኙଃᐛऱեឫΔ༼ᙃᢝ壄ᒔ৫Ζ Ϥ2 A-CMVNٺڇጟᒘڗᑇNऱݮՀΔᙃᢝઃC-CMVNΕU-CMVNࠐړΔխאN =16ழࠋΔڇAᠧಛᛩቼፖBᠧಛᛩቼՀհᙃᢝ88118897ࡉΔઌለC-CMVNM AڇᙃᢝΔA-CMVNࠋհࢬ256=ᠧಛᛩቼፖBᠧಛᛩቼঞޡԱ101ፖ165ΔຍޡࠄຟقԱA-CMVNᚌC-CMVNΙᇿU-CMVNለழΔA-CMVNڇAᠧಛᛩቼፖBᠧಛᛩቼՀΔᙃᢝ341ࡉ308༽אױΔઌኙޏ2055ፖ2362Ζᣊۿհছऱ
241
A-CMSΔA-CMVNٵழᖞٽԱC-CMVNፖU-CMVNشࢬऱอૠᇷಛΔݺڼڂଚቃཚໂԱࠋޓऱଃᐛൎ֏ऱய ΔኔᑇᖕՈᒔኔᢞԱA-CMVNऱᚌC-CMVNፖU-CMVNΖ
քΕU-CMVNΕᄅC-CMVNፖA-CMVNऱᙃᢝ
Method Set A Set B Average AR RR Baseline 7192 6779 6986ʳ
U-CMVN 8503 8556 8530ʳ 1544ʳ 5122ʳC-CMVN(M=256) 8710 8732 8721ʳ 1736ʳ 5757ʳ
A-CMVN(M =16 =07) 8811 8897 8854ʳ 1869ʳ 6198ʳA-CMVN(M=64 =08) 8800 8856 8828ʳ 1843ʳ 6112ʳ
A-CMVN(M=256 =08) 8735 8805 8770ʳ 1785ʳ 5920ʳ
৵ΔԮA-HEQڇᒘڗᑇؾM 16Ε64ፖ256ՀΔࢬऱࠋᙃᢝΔԱለದߠΔݺଚՈലխऱഗءኔΕC-HEQ(M=256)ፖU-HEQऱᙃᢝ٨ ΚݮՀጟאᨠኘאױଚݺխΔڼխΖൕڇ
Ϥ1 ኙڤٽอૠቹ֏ऄ(A-HEQ)ߢΔᓵڇᒘڗᑇM =16Ε64ፖ256ՀΔᙃᢝઌለഗءኔߢΔຟڶՕऱޏΔԿڇAᠧಛᛩቼՀڶ1815Ε1728ፖ1576ऱᙃᢝ༼ΔڇBᠧಛᛩቼՀ2308ڶΕ2236ፖ2110ऱᙃᢝ༼ΔقԱA-HEQڇଃᐛൎࢤऱய౨Δઌለհছࢬ૪ऱጟڤٽᐛإ֏ऄA-CMSፖA-CMVNΔA-HEQऱޓᚌฆΖ Ϥ2 A-HEQٺڇጟᒘڗᑇMऱݮՀΔᙃᢝઃC-HEQፖU-HEQࠐړΔխאM=16ࢬऱᙃᢝࠋΔڇAᠧಛᛩቼፖBᠧಛᛩቼՀհᙃᢝ90079087ࡉΔઌለC-HEQM AᠧಛᛩڇᙃᢝΔA-HEQࠋհࢬ256=ቼፖBᠧಛᛩቼՀᙃᢝঞޡԱ385ፖ480ΔຍޡࠄຟقԱA-HEQᚌAᠧಛᛩቼፖBᠧಛᛩቼՀᙃᢝڇC-HEQΙᇿU-HEQለழΔA-HEQ༼Ա307ፖ254Δઌኙޏ2362ፖ2176ΖᣊۿհছऱΔຍᇙA-HEQC-HEQፖܛऱऄΔڤፖᖞڤᒘऱऄᚌڤٽᢞԱڻଚ٦ݺU-HEQޓ౨༼ᠧಛᛩቼՀଃᙃᢝऱ壄ᒔ৫Ζ
ԮΕU-HEQΕᄅC-HEQፖA-HEQऱᙃᢝ
Method Set A Set B Average AR RR Baseline 7192 6779 6986ʳU-HEQ 8700 8833 8767ʳ 1781ʳ 5908ʳ
C-HEQ(M=256) 8622 8607 8615ʳ 1629ʳ 5404ʳA-HEQ(M=16 =09) 9007ʳ 9087ʳ 9047ʳ 2062ʳ 6839ʳA-HEQ(M=64 =09) 8920ʳ 9015ʳ 8968ʳ 1982ʳ 6575ʳA-HEQ(M=256 =1) 8768ʳ 8889ʳ 8829ʳ 1843ʳ 6114ʳ
քΕᓵፖࠐآ୶ඨ ऄװΔଙᢜݾ֏إಘᓵऱᐛᑇଚݺᓵխΔءڇ
(CMS)Εଙᢜଖፖฆᑇإ֏ऄ(CMVN)ፖଙᢜอૠቹ֏ऄ(HEQ)ΔຍԿጟݾઃႊشࠌᐛऱอૠၦΖႚอՂΔຍࠄอૠၦਢᆖطԫᖞऱଃᐛ۷
ΖڼڂΔኙᚨऱݾΔݺଚอጠᖞڤ(utterance-based)ᐛᑇإ֏ݾΖڇ
242
२ࠐڣΔءኔ୶Աᒘڤ(codebook-based)ᐛᑇإ֏ݾΔC-CMSፖC-CMVNΖټ৸ᆠΔڇຍࠄऄխΔشࠌࢬऱᐛอૠၦਢطᒘૠጩΔኔᢞኔຍࠄᒘڤᐛᑇإ֏ݾՕીՂઃᚌᖞڤᐛᑇإ֏ݾ
ΖݺଚΔଚսڶԫޡऱޏΖڼڂΔءᓵխݺଚ༼נԱԫޏ
ݾԱଃೠشଚᚨݺڇհΔٵऱլݧΔઌኙݧمऱᒘ৬ڤߜ
ଃಛᇆΔ৵شܓଃऱଃᐛࠐಝᒭᒘڗΙڼ؆Δຍࠄᒘڗᖕො
ऱᐛᑇؾᓿղլٵऱᦞૹ(weight)ΔߜޏڼऄڇรԿڶᇡาऱᎅΖ ೈԱ༼נՂ૪ڤߜޏऱᒘ৬مݧհ؆ΔءᓵԫૹរڇΔݺଚ༼נԱԫߓࠄΔA-CMSΕA-CMVNፖA-HEQΔຍݾ֏إᐛᑇ(associative)ڤٽ٨৵հอૠٽᖞڼشऱᐛอૠᇷಛΔشࢬݾڤፖᒘݾڤԱᖞٽଚᖞݺխΔݾ
ၦࠐചCMSΔCMVNࢨHEQΔᇡ૪รխΔຍᑌऱݾڶאױயچᇖᚍᒘإऱᐛᑇڤٽऱរΔรնխऱኔᢞኔΔߩխΔᠧಛᇷಛլݾڤ
ᙃᢝ壄ᒔ৫Ζ༽چޓΔ౨ݾ֏إᐛᑇڤፖᒘڤᖞݾ֏ ឈڤٽᐛᑇإ֏ݾயԼထΔࠋڶᘸਬطࠄᑇխऱ(48)ڤΕ(45)ڤܛ) խऱ(410)ڤ )ऱ೯ᓳᖞࠐᖞٽᒘڤፖᖞڤհอૠᇷಛΔࠐآڇڼڂऱ୶ՂΔݺଚݦඨ౨೯ࠋנޣچऱ ፖ ᑇଖΔࠐኙ
ऱอૠᇷಛޓ壄ᒔऱᖞٽΔٵழΔڇ৬ዌᠧಛଃᒘऱݧՂΔݺଚՈݦඨ౨ە
ڍᠧಛ۷ऱऄΔޓ壄ᒔԫଃխᠧಛऱอૠࢤΔཚৱڶޓயچ༼ᒘ
ڤᐛᑇإ֏ݾऱய౨Ζ
ە [1] Chung-fu Tai and Jeih-weih Hung Silence Energy Normalization for Robust Speech
Recognition in Additive Noise Environments 2006 International Conference on Spoken Language Processing (Interspeech 2006mdashICSLP)
[2] S Furui Cepstral Analysis Technique for Automatic Speaker Verification IEEE Trans on Acoustics Speech and Signal Processing 1981
[3] S Tiberewala and H Hermansky Multiband and Adaptation Approaches to Robust Speech Recognition 1997 European Conference on Speech Communication and Technology (Eurospeech 1997)
[4] A Torre J Segura C Benitez A M Peinado and A J Rubio Non-Linear Transformations of the Feature Space for Robust Speech Recognition 2002 International Conference on Acoustics Speech and Signal Processing (ICASSP 2002)
[5] Tsung-hsueh Hsieh Feature Statistics Compensation for Robust Speech Recognition in Additive Noise Environments MS thesis National Chi Nan University Taiwan 2007
[6] Tsung-hsueh Hsieh and Jeih-weih Hung Speech Feature Compensation Based on Pseudo Stereo Codebooks for Robust Speech Recognition in Additive Noise Environments 2007 European Conference on Speech Communication and Technology (Interspeech 2007mdashEurospeech)
[7] Jeih-weih Hung Cepstral Statistics Compensation and Normalization Using Online Pseudo Stereo Codebooks for Robust Speech Recognition in Additive Noise Environments IEICE Transactions on Information and Systems 2008
[8] H G Hirsch and D Pearce The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions Proceedings of ISCA IIWR ASR2000 Paris France 2000
[9] ITU recommendation G712 Transmission Performance Characteristics of Pulse Code Modulation Channels Nov 1996
[10] httphtkengcamacuk
243