Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.

Speex encoder Speex encoder projectproject

Presented by: Presented by: Gruper Gruper LeetalLeetal

Kamelo TsafrirKamelo Tsafrir

Instructor: Instructor: Guz ZvikaGuz Zvika

Software performance enhancement using multithreading, SIMD and architectural considerations

SpeexSpeex

Speex is an Open Source/Free Speex is an Open Source/Free Software Software audio compression audio compression formatformat designed for speech. designed for speech.

Speex is designed for packet Speex is designed for packet networks and voice over IP (VoIP) networks and voice over IP (VoIP) application. File-based application. File-based compression is of course also compression is of course also supported.supported.

Initial code analysisInitial code analysis

Original Speex run time

Tuned functions using SSE

Tuned function using threading & SSE

Initial code analysisInitial code analysis

Speex encoding Speex encoding overviewoverview

Audio StreamFrame 1 Frame 2 Frame 4

EncodeEncodeEncodeEncodeRead FrameRead FrameRead FrameRead Frame

Frame 3

Write frameWrite frameWrite frameWrite frame

Our work on SpeexOur work on Speex

Threading, SSE and compiling Threading, SSE and compiling with Intel compilerwith Intel compiler

Threading by data Threading by data decomposition – trial 1decomposition – trial 1 Creating a couple of threads and Creating a couple of threads and

giving each one of them a frame to giving each one of them a frame to work on. work on.

Main thread that read from the input Main thread that read from the input file frames and for each frame file frames and for each frame creates a "worker" thread. Each creates a "worker" thread. Each worker thread does the encoding on worker thread does the encoding on the given frame and writes it to the the given frame and writes it to the output file. output file.

ProblemProblem & Solution & Solution

Problem: it takes a lot of time Problem: it takes a lot of time creating a new thread for every new creating a new thread for every new frame.frame.

Solution: create a couple of threads Solution: create a couple of threads (same as the number of cores in the (same as the number of cores in the CPU) only in the beginning. For CPU) only in the beginning. For example, if we have 2 cores, we have example, if we have 2 cores, we have a main thread and 2 worker threads. a main thread and 2 worker threads. One deal with all the even frames One deal with all the even frames and one deal with all the odd frames. and one deal with all the odd frames.

Main thread

Encoding Encoding Encoding

Thread Thread Thread… …

Input wav file

Output spx file

Threading by data Threading by data decomposition – trial 1decomposition – trial 1 6655% speedup!% speedup! But…But…

– although Speex is robust to packet although Speex is robust to packet lost, it does depend on the previous lost, it does depend on the previous packet when it can packet when it can → → definite definite reduction in the quality of reduction in the quality of sound. sound.

Threading by data Threading by data decomposition – trial 2decomposition – trial 2

Split the file into two. One Split the file into two. One thread will handle half a file thread will handle half a file and the other will handle the and the other will handle the other half. other half.

Thread A

Thread B

Input wav file Output spx file

EncodingEncoding

EncodingEncoding

Threading by data Threading by data decomposition – trial 2decomposition – trial 2 8877% speedup for 2 threads% speedup for 2 threads 93% speedup for 22 threads93% speedup for 22 threads

Threading by data Threading by data decomposition – trial 2decomposition – trial 2 Passed Vtunes thread checker:Passed Vtunes thread checker:

vq_nbestvq_nbest

vq_nbest - finds the n best vectors in vq_nbest - finds the n best vectors in the codebook which means that it the codebook which means that it checks each vector against the given checks each vector against the given vector and compares it against the vector and compares it against the rest of the vectors it already found to rest of the vectors it already found to be best.be best.

In a regular run, the loop is performed In a regular run, the loop is performed 256 times on average – big waste of 256 times on average – big waste of time.time.

Threading by Threading by functional functional decomposition - Trial 1decomposition - Trial 1 We splited the search in the codebook We splited the search in the codebook

between a couple of threads. between a couple of threads. Each thread received a portion of the Each thread received a portion of the

codebook and was sapouse to find the codebook and was sapouse to find the n best vectors in it. Then, some of the n best vectors in it. Then, some of the threads would fined the n best threads would fined the n best between there own found vectors and between there own found vectors and the ones a different thread found until the ones a different thread found until we get one set of n best vectors. we get one set of n best vectors.

Threading by Threading by functional functional decomposition - Trial 1decomposition - Trial 1 Because of the overhead of creating all the Because of the overhead of creating all the

threads we actually got worse performances. threads we actually got worse performances. It takes approximately 72µs to create a It takes approximately 72µs to create a

thread. The original vq_nbest runtime is 13 thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time µs. Even if we only create 1 thread each time and not to mention the overhead time of and not to mention the overhead time of synchronizing all the threads, we will still get synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. a slowdown of ~5.5 in the vq_nbest runtime.

As a result, the total runtime of Speex was As a result, the total runtime of Speex was 1.45 minutes!1.45 minutes!

Threading by Threading by functional functional decomposition - Trial 2decomposition - Trial 2 a "main" thread and a couple of a "main" thread and a couple of

"worker" threads that will live "worker" threads that will live throughout all the run of Speex.throughout all the run of Speex.

The main thread will do all the routines The main thread will do all the routines that mustn't be done in parallel and that mustn't be done in parallel and every time it will encounter a function every time it will encounter a function that can be split into a couple of that can be split into a couple of threads, it will give the work to the threads, it will give the work to the existing threads.existing threads.

Threading by Threading by functional functional decomposition - Trial 2decomposition - Trial 2 Original vq_nbest runs a total time of 10.5sec. Original vq_nbest runs a total time of 10.5sec. After the SSE improvement, it runs a total time After the SSE improvement, it runs a total time

of 8 sec.of 8 sec. perfect parallelization would bring us to 4 sec perfect parallelization would bring us to 4 sec

+ overhead of synchronization ~ 5 sec. + overhead of synchronization ~ 5 sec. → → total Speex runtime would drop to 33.14sec. total Speex runtime would drop to 33.14sec. In the most optimal prediction, the speedup of In the most optimal prediction, the speedup of

the new method including the SSE would be the new method including the SSE would be 16.5% → 10% improvement after SSE 16.5% → 10% improvement after SSE improvement.improvement.

Threading by Pipelined Threading by Pipelined Data Domain Data Domain DecompositionDecomposition Each frame encoding is depended Each frame encoding is depended

on the previous’ frame gain, on the previous’ frame gain, excitation and the adaptive excitation and the adaptive codebook. codebook.

One main object holding all the One main object holding all the parameters for the encoding parameters for the encoding process. process.

qmf_decomp speex_encode spx_autocorr

Compute the two sub-bands using the input wav and h0,h1 (filters computed in the prev frame)

Encode the narrowband part using the input from qmf_decomp and h0

High-band buffering / sync with low band

Start encoding the high-

bandCompute auto-correlation using the high band mixed with the narrow band

Moving the high band bitsin the time domain and insertingthe narrow band bits in the correct place

filtering in the power-

spectrum domain

Using the auto-correlation

WLD

Levinson-Durbin using the auto-correlation and the lpc

lpc_to_lsp

LPC to LSPs (x-domain) transform

x-domain to angle

domain of lsp

Final signal synthesis from excitation using the excitation, the high band part and the lsp

iir_mem2

fir_mem_up

Using the h0, h1 filters and the gain on the full frame

LSP quantization

LSP interpolation

(quantized and unquantized)

Using the current lsp’ and the previous’ frame lsp’

Using the current lsp’ and the previous’ frame lsp’

Compute mid-band (4000 Hz for wideband)

response of low-band and high-band filters

Using the lsp, excitation and gain of the wideband

Threading by Pipelined Threading by Pipelined Data Domain Data Domain DecompositionDecomposition unfortunately, we could not unfortunately, we could not

perform Pipelined Data Domain perform Pipelined Data Domain Decomposition on the Speex. Decomposition on the Speex.

Streaming SIMD Streaming SIMD Extensions Extensions

using intrinsic SSE commends we have using intrinsic SSE commends we have re-written the next functions: re-written the next functions: – Inner_prodInner_prod

Function speedup: 52%. Total speedup: 1.31%Function speedup: 52%. Total speedup: 1.31%

– vq_nbest vq_nbest Function speedup: 31.25%. Total speedup 7%Function speedup: 31.25%. Total speedup 7%

– vq_nbest_sign vq_nbest_sign Function speedup: 40%. Total speedup 3%Function speedup: 40%. Total speedup 3%

– split_cb_search_shape_sign split_cb_search_shape_sign Function speedup: 2%. Total speedup Function speedup: 2%. Total speedup

0.5%0.5%

0

2

4

6

8

10

12

Seconds

inner_prod vq_nbest vq_nbest_sign cb_search

SSE tuned functions

Original function time New function time

52.08333333

31.25

39.62264151

2.105263158

0

10

20

30

40

50

60

%


Function speedup

Function speedup

1.310753421

6.916394622

2.792850303

0.520210165

0

2

4

6

8

%


Total Speex speedup

Total speedup

Compiling with Intel® Compiling with Intel® C CompilerC Compiler We used the next flags for the We used the next flags for the

compilation of the final version of Speex compilation of the final version of Speex with the SSE and the threading with the SSE and the threading improvements:improvements:

General:General:– Detect 64-bit portability issues – Yes(/Wp64)Detect 64-bit portability issues – Yes(/Wp64)– Optimization - Maximize Speed (/O2)Optimization - Maximize Speed (/O2)– inline intrinsic functions - Yes (/Oi)inline intrinsic functions - Yes (/Oi)– favor size or speed - Favor Fast Code (/Ot)favor size or speed - Favor Fast Code (/Ot)

Intel specific:Intel specific:– Global optimization - Yes (/Og)Global optimization - Yes (/Og)– Floating point precision improvement – Floating point precision improvement –

NoneNone– Floating point speculation - Fast (/Qfp-Floating point speculation - Fast (/Qfp-

speculationfast)speculationfast)– Use Intel® processor extensions - Intel Use Intel® processor extensions - Intel

CoreTM 2 Duo Processor (/QaxT)CoreTM 2 Duo Processor (/QaxT)– Parallelization - Enable Parallelization Parallelization - Enable Parallelization

(/Qparallel)(/Qparallel)

Main thread

2 worker threads

38.646

20.624 20.053

87

38.146 36.146 37.596 38.446

18.438 16.984

0

10

20

30

40

50

60

70

80

90O

rig

ina

l

Da

tad

eco

mp

osi

tion

with

2 th

rea

ds

Da

tad

eco

mp

osi

tion

with

22

thre

ad

s

Fu

nct

ion

al

de

com

po

sitio

n

SS

Ein

ne

r_p

rod

SS

E v

q_

nb

est

SS

Evq

_n

be

st_

sig

n

SS

Ecb

_se

arc

h

Eve

ryth

ing

Inte

l co

mp

iler

Speex total runtime

Run time

sec

87.3836307292.71929387

-55.57931034

1.3107534216.916394622

2.792850303 0.520210165

109.5997397

127.5435704

-60

-40

-20

0

20

40

60

80

100

120

140D

ata

de

com

po

sitio

nw

ith 2

thre

ad

s

Da

tad

eco

mp

osi

tion

with

22

thre

ad

s

Fu

nct

ion

al

de

com

po

sitio

n

SS

Ein

ne

r_p

rod

SS

E v

q_

nb

est

SS

Evq

_n

be

st_

sig

n

SS

Ecb

_se

arc

h

Eve

ryth

ing

Inte

l co

mp

iler

Speex total speedup

Speedup

%

SummerySummery

We got 127.54% speedup using We got 127.54% speedup using threading, intrinsic SSE functions threading, intrinsic SSE functions and the Intel compiler. and the Intel compiler.

The project goal was achieved!!The project goal was achieved!!

Thank youThank you

Leetal and TsafrirLeetal and Tsafrir

Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.

Documents

frame slide

frame frame

main thread

new thread

time slide

new frame

given frame

speex speex