This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Software performance enhancement using multithreading, SIMD and architectural considerations
SpeexSpeex
Speex is an Open Source/Free Speex is an Open Source/Free Software Software audio compression audio compression formatformat designed for speech. designed for speech.
Speex is designed for packet Speex is designed for packet networks and voice over IP (VoIP) networks and voice over IP (VoIP) application. File-based application. File-based compression is of course also compression is of course also supported.supported.
Threading, SSE and compiling Threading, SSE and compiling with Intel compilerwith Intel compiler
Threading by data Threading by data decomposition – trial 1decomposition – trial 1 Creating a couple of threads and Creating a couple of threads and
giving each one of them a frame to giving each one of them a frame to work on. work on.
Main thread that read from the input Main thread that read from the input file frames and for each frame file frames and for each frame creates a "worker" thread. Each creates a "worker" thread. Each worker thread does the encoding on worker thread does the encoding on the given frame and writes it to the the given frame and writes it to the output file. output file.
ProblemProblem & Solution & Solution
Problem: it takes a lot of time Problem: it takes a lot of time creating a new thread for every new creating a new thread for every new frame.frame.
Solution: create a couple of threads Solution: create a couple of threads (same as the number of cores in the (same as the number of cores in the CPU) only in the beginning. For CPU) only in the beginning. For example, if we have 2 cores, we have example, if we have 2 cores, we have a main thread and 2 worker threads. a main thread and 2 worker threads. One deal with all the even frames One deal with all the even frames and one deal with all the odd frames. and one deal with all the odd frames.
Main thread
Encoding Encoding Encoding
Thread Thread Thread… …
Input wav file
Output spx file
Threading by data Threading by data decomposition – trial 1decomposition – trial 1 6655% speedup!% speedup! But…But…
– although Speex is robust to packet although Speex is robust to packet lost, it does depend on the previous lost, it does depend on the previous packet when it can packet when it can → → definite definite reduction in the quality of reduction in the quality of sound. sound.
Threading by data Threading by data decomposition – trial 2decomposition – trial 2
Split the file into two. One Split the file into two. One thread will handle half a file thread will handle half a file and the other will handle the and the other will handle the other half. other half.
Thread A
Thread B
Input wav file Output spx file
EncodingEncoding
EncodingEncoding
Threading by data Threading by data decomposition – trial 2decomposition – trial 2 8877% speedup for 2 threads% speedup for 2 threads 93% speedup for 22 threads93% speedup for 22 threads
Threading by data Threading by data decomposition – trial 2decomposition – trial 2 Passed Vtunes thread checker:Passed Vtunes thread checker:
vq_nbestvq_nbest
vq_nbest - finds the n best vectors in vq_nbest - finds the n best vectors in the codebook which means that it the codebook which means that it checks each vector against the given checks each vector against the given vector and compares it against the vector and compares it against the rest of the vectors it already found to rest of the vectors it already found to be best.be best.
In a regular run, the loop is performed In a regular run, the loop is performed 256 times on average – big waste of 256 times on average – big waste of time.time.
Threading by Threading by functional functional decomposition - Trial 1decomposition - Trial 1 We splited the search in the codebook We splited the search in the codebook
between a couple of threads. between a couple of threads. Each thread received a portion of the Each thread received a portion of the
codebook and was sapouse to find the codebook and was sapouse to find the n best vectors in it. Then, some of the n best vectors in it. Then, some of the threads would fined the n best threads would fined the n best between there own found vectors and between there own found vectors and the ones a different thread found until the ones a different thread found until we get one set of n best vectors. we get one set of n best vectors.
Threading by Threading by functional functional decomposition - Trial 1decomposition - Trial 1 Because of the overhead of creating all the Because of the overhead of creating all the
threads we actually got worse performances. threads we actually got worse performances. It takes approximately 72µs to create a It takes approximately 72µs to create a
thread. The original vq_nbest runtime is 13 thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time µs. Even if we only create 1 thread each time and not to mention the overhead time of and not to mention the overhead time of synchronizing all the threads, we will still get synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. a slowdown of ~5.5 in the vq_nbest runtime.
As a result, the total runtime of Speex was As a result, the total runtime of Speex was 1.45 minutes!1.45 minutes!
Threading by Threading by functional functional decomposition - Trial 2decomposition - Trial 2 a "main" thread and a couple of a "main" thread and a couple of
"worker" threads that will live "worker" threads that will live throughout all the run of Speex.throughout all the run of Speex.
The main thread will do all the routines The main thread will do all the routines that mustn't be done in parallel and that mustn't be done in parallel and every time it will encounter a function every time it will encounter a function that can be split into a couple of that can be split into a couple of threads, it will give the work to the threads, it will give the work to the existing threads.existing threads.
Threading by Threading by functional functional decomposition - Trial 2decomposition - Trial 2 Original vq_nbest runs a total time of 10.5sec. Original vq_nbest runs a total time of 10.5sec. After the SSE improvement, it runs a total time After the SSE improvement, it runs a total time
of 8 sec.of 8 sec. perfect parallelization would bring us to 4 sec perfect parallelization would bring us to 4 sec
+ overhead of synchronization ~ 5 sec. + overhead of synchronization ~ 5 sec. → → total Speex runtime would drop to 33.14sec. total Speex runtime would drop to 33.14sec. In the most optimal prediction, the speedup of In the most optimal prediction, the speedup of
the new method including the SSE would be the new method including the SSE would be 16.5% → 10% improvement after SSE 16.5% → 10% improvement after SSE improvement.improvement.
Threading by Pipelined Threading by Pipelined Data Domain Data Domain DecompositionDecomposition Each frame encoding is depended Each frame encoding is depended
on the previous’ frame gain, on the previous’ frame gain, excitation and the adaptive excitation and the adaptive codebook. codebook.
One main object holding all the One main object holding all the parameters for the encoding parameters for the encoding process. process.
qmf_decomp speex_encode spx_autocorr
Compute the two sub-bands using the input wav and h0,h1 (filters computed in the prev frame)
Encode the narrowband part using the input from qmf_decomp and h0
High-band buffering / sync with low band
Start encoding the high-
bandCompute auto-correlation using the high band mixed with the narrow band
Moving the high band bitsin the time domain and insertingthe narrow band bits in the correct place
filtering in the power-
spectrum domain
Using the auto-correlation
WLD
Levinson-Durbin using the auto-correlation and the lpc
lpc_to_lsp
LPC to LSPs (x-domain) transform
x-domain to angle
domain of lsp
Final signal synthesis from excitation using the excitation, the high band part and the lsp
iir_mem2
fir_mem_up
Using the h0, h1 filters and the gain on the full frame
LSP quantization
LSP interpolation
(quantized and unquantized)
Using the current lsp’ and the previous’ frame lsp’
Using the current lsp’ and the previous’ frame lsp’
Compute mid-band (4000 Hz for wideband)
response of low-band and high-band filters
Using the lsp, excitation and gain of the wideband
Threading by Pipelined Threading by Pipelined Data Domain Data Domain DecompositionDecomposition unfortunately, we could not unfortunately, we could not
perform Pipelined Data Domain perform Pipelined Data Domain Decomposition on the Speex. Decomposition on the Speex.
using intrinsic SSE commends we have using intrinsic SSE commends we have re-written the next functions: re-written the next functions: – Inner_prodInner_prod
Function speedup: 52%. Total speedup: 1.31%Function speedup: 52%. Total speedup: 1.31%
– vq_nbest vq_nbest Function speedup: 31.25%. Total speedup 7%Function speedup: 31.25%. Total speedup 7%
– vq_nbest_sign vq_nbest_sign Function speedup: 40%. Total speedup 3%Function speedup: 40%. Total speedup 3%
– split_cb_search_shape_sign split_cb_search_shape_sign Function speedup: 2%. Total speedup Function speedup: 2%. Total speedup
0.5%0.5%
0
2
4
6
8
10
12
Seconds
inner_prod vq_nbest vq_nbest_sign cb_search
SSE tuned functions
Original function time New function time
52.08333333
31.25
39.62264151
2.105263158
0
10
20
30
40
50
60
%
inner_prod vq_nbest vq_nbest_sign cb_search
Function speedup
Function speedup
1.310753421
6.916394622
2.792850303
0.520210165
0
2
4
6
8
%
inner_prod vq_nbest vq_nbest_sign cb_search
Total Speex speedup
Total speedup
Compiling with Intel® Compiling with Intel® C CompilerC Compiler We used the next flags for the We used the next flags for the
compilation of the final version of Speex compilation of the final version of Speex with the SSE and the threading with the SSE and the threading improvements:improvements:
Intel specific:Intel specific:– Global optimization - Yes (/Og)Global optimization - Yes (/Og)– Floating point precision improvement – Floating point precision improvement –
NoneNone– Floating point speculation - Fast (/Qfp-Floating point speculation - Fast (/Qfp-
speculationfast)speculationfast)– Use Intel® processor extensions - Intel Use Intel® processor extensions - Intel
CoreTM 2 Duo Processor (/QaxT)CoreTM 2 Duo Processor (/QaxT)– Parallelization - Enable Parallelization Parallelization - Enable Parallelization
(/Qparallel)(/Qparallel)
Main thread
2 worker threads
38.646
20.624 20.053
87
38.146 36.146 37.596 38.446
18.438 16.984
0
10
20
30
40
50
60
70
80
90O
rig
ina
l
Da
tad
eco
mp
osi
tion
with
2 th
rea
ds
Da
tad
eco
mp
osi
tion
with
22
thre
ad
s
Fu
nct
ion
al
de
com
po
sitio
n
SS
Ein
ne
r_p
rod
SS
E v
q_
nb
est
SS
Evq
_n
be
st_
sig
n
SS
Ecb
_se
arc
h
Eve
ryth
ing
Inte
l co
mp
iler
Speex total runtime
Run time
sec
87.3836307292.71929387
-55.57931034
1.3107534216.916394622
2.792850303 0.520210165
109.5997397
127.5435704
-60
-40
-20
0
20
40
60
80
100
120
140D
ata
de
com
po
sitio
nw
ith 2
thre
ad
s
Da
tad
eco
mp
osi
tion
with
22
thre
ad
s
Fu
nct
ion
al
de
com
po
sitio
n
SS
Ein
ne
r_p
rod
SS
E v
q_
nb
est
SS
Evq
_n
be
st_
sig
n
SS
Ecb
_se
arc
h
Eve
ryth
ing
Inte
l co
mp
iler
Speex total speedup
Speedup
%
SummerySummery
We got 127.54% speedup using We got 127.54% speedup using threading, intrinsic SSE functions threading, intrinsic SSE functions and the Intel compiler. and the Intel compiler.
The project goal was achieved!!The project goal was achieved!!