Real-Time Graphics Processing Unit Implementation

Real-Time Graphics Processing Unit Implementation of Whitening Filters for Audio Signals

by Omer A.S. Osman

B.S. in Electrical Engineering, May 2010, The George Washington University

A Thesis submitted to

The Faculty of The School of Engineering and Applied Science

of The George Washington University in partial satisfaction of the requirements

for the degree of Master of Science

August 31, 2011

Thesis directed by

Miloš Doroslovački Associate Professor of Engineering and Applied Science

ii

© Copyright 2011 by Omer A.S. Osman

All Rights Reserved.

iii

Abstract

Real-Time Graphics Processing Unit Implementation

of Whitening Filters for Audio Signals

This work investigates a real-time implementation of autoregressive and

pitch-prediction whitening filters for use in audio feedback suppression. The work begins

by analyzing whitening filters performance for synthesized and recorded test audio

signals. A MATLAB simulation of the adaptive feedback cancellation (AFC) algorithm

shows pitch-prediction to be the most computationally intensive aspect of the feedback

cancellation algorithm. A DSP processor implementation is demonstrated in which the

autoregressive filter implementation outperforms MATLAB implementation computation

time while the pitch-prediction implementation fails to meet real-time requirements. A

successful real-time implementation of the pitch-prediction algorithm is demonstrated on

NVIDIA graphics processing unit (GPU) with substantial speed gains compared to the

MATLAB implementation.

iv

Table of Contents

Abstract ............................................................................................................................... iii

Table of Contents ............................................................................................................. iv

List of Figures ................................................................................................................... vii

List of Tables ................................................................................................................... viii

Glossary of Terms and Acronyms ............................................................................... ix

Chapter 1 – Introduction ................................................................................................ 1

1.1. Research Problem ............................................................................................................. 1

1.2. Autoregressive Modeling ................................................................................................ 3

1.3. Pitch Linear Prediction Modeling ................................................................................ 3

1.4. Contributions ...................................................................................................................... 4

Chapter 2 – Theoretical Background ......................................................................... 5

2.1. Autoregressive Modeling using the Autocorrelation Method ............................ 5

2.2. 3-‐Tap Pitch Prediction Model ........................................................................................ 6

2.3. Issues in Real-‐Time Implementation .......................................................................... 8

Chapter 3 – Autoregressive and Pitch Prediction Filters Performance ......... 9

3.1. Test Filters Conditions ................................................................................................... 10

3.2. Test Metrics ....................................................................................................................... 11

3.3. Synthesized Test Signals ............................................................................................... 13

3.3.1. Colored Noise ............................................................................................................................. 13

3.3.1.1. Autoregressive Filter Response to Synthesized Colored Noise ................................... 15

3.3.2. Synthesized Ab Note ................................................................................................................ 18

3.3.2.1. Cascade Filter Response ............................................................................................................... 19

v

3.4. Recorded Test Signals .................................................................................................... 20

3.4.1. Recorded Speech Signals ...................................................................................................... 21

3.4.1.1. Speech Sibilance Signal ................................................................................................................. 21

3.4.1.1.1. Autoregressive Filter Response to Recorded ‘s’ Sound ......................................... 22

3.4.1.2. Speech Vowel Signal ....................................................................................................................... 24

3.4.1.2.1. Autoregressive Filter Response to Recorded ‘ah’ Sound ...................................... 25

3.4.2. Recorded Musical Notes ........................................................................................................ 29

3.4.2.1. Monophonic Audio Signal -‐ Piano Note .................................................................................. 29

3.4.2.1.1. Cascade Pitch and Autoregressive Response – Monophonic Input Signal .... 30

3.4.2.2. Polyphonic Audio Signal – Piano Chord ................................................................................. 33

3.4.2.2.1. Cascade Pitch and Autoregressive Filters Response – Polyphonic Input ...... 34

3.4.2.3. Polyphonic Audio Signal – Piano Chord with Bass Note ................................................. 35

3.4.2.3.1. Cascade Pitch and Autoregressive Filters – Polyphonic Input with Bass Note

.................................................................................................................................................................................... 36

3.5. Discussion .......................................................................................................................... 38

Chapter 4 – DSP Implementation .............................................................................. 40

4.1. Challenges in Implementation .................................................................................... 40

4.2. Target Architecture ........................................................................................................ 41

4.3. DSP Processor vs FPGA .................................................................................................. 41

4.4. DSP Processor Performance Results ........................................................................ 42

4.4.1. Recorded Sibilance – AR Filter Testing .......................................................................... 42

4.4.2. Recorded Ab Piano Note ........................................................................................................ 44

4.4.3. Processor Profiling .................................................................................................................. 44

4.5. Problems Encountered .................................................................................................. 46

4.5.1. Memory Segmentation ........................................................................................................... 46

4.5.2. Stack Overflow .......................................................................................................................... 46

vi

4.5.3. Hardware Division ................................................................................................................... 47

4.6. Discussion .......................................................................................................................... 47

Chapter 5 – GPU Implementation .............................................................................. 48

5.1. Target Architecture ........................................................................................................ 48

5.2. Algorithm Implementation .......................................................................................... 49

5.2.1. PLP Filter Implementation ................................................................................................... 49

5.2.2. AR Filter Implementation ..................................................................................................... 51

5.3. Numerical Accuracy in CUDA Implementation ...................................................... 52

5.4. Problems Encountered .................................................................................................. 53

Chapter 6 -‐ Conclusions ................................................................................................ 55

6.1. Filters Performance ........................................................................................................ 55

6.2. Development Cost ........................................................................................................... 56

6.3. Final Remarks ................................................................................................................... 56

References ......................................................................................................................... 57

Appendix A – DSP Implementation Code Listing .................................................. 59

Appendix B – GPU Implementation Code Listing ................................................. 74

Header File with Algorithm Definitions ........................................................................... 74

Main Driver File ....................................................................................................................... 74

CUDA GPU Driver File ............................................................................................................. 85

vii

List of Figures

Figure 1-‐1 MATLAB Profiling of AFC Algorithm Showing Linear Prediction Time Consumption __________ 2

Figure 3-‐1 Colored Noise Source using a Butterworth Filter fc = 3 kHz ___________________________________ 14

Figure 3-‐2 Frequency Response of the AR Filter with Colored Noise Input ________________________________ 15

Figure 3-‐3 Pole-‐Zero Plot Demonstrating AR Filter Response _____________________________________________ 16

Figure 3-‐4 AR Filter Residual Signal Showing a Flattened Spectrum _____________________________________ 17

Figure 3-‐5 Synthesized Ab Note Frequency Spectrum ______________________________________________________ 18

Figure 3-‐6 Residual Spectrum after Cascaded PLP and AR Filters Whitening ____________________________ 20

Figure 3-‐7 Recorded 's' Sound from a Male Voice __________________________________________________________ 22

Figure 3-‐8 Whitened Recorded 's' Signal Filtered using Autoregressive Filter ___________________________ 24

Figure 3-‐9 Recorded 'ah' Sound from a Male Voice Input Signal Spectrum _______________________________ 25

Figure 3-‐10 Cascade PLP and Autoregressive Filters Structure ___________________________________________ 25

Figure 3-‐11 Cascade Filter Structure with Pre-‐Whitening Filter __________________________________________ 26

Figure 3-‐12 Residual Spectrum of ‘ah’ Vocalization after AR Filtering ___________________________________ 27

Figure 3-‐13 Residual Spectrum of Cascade Filters of Recorded 'ah' Sound _______________________________ 28

Figure 3-‐14 Recorded Ab Piano Note ________________________________________________________________________ 30

Figure 3-‐15 Cascade PLP Filter Output Residual with Pre-‐Whitening ____________________________________ 32

Figure 3-‐16 Polyphonic Test Signal with an Ab7 Piano Chord ______________________________________________ 33

Figure 3-‐17 Cascade PLP and AR Filters Residual for Ab7 Chord Polyphonic Input ______________________ 35

Figure 3-‐18 Recorded Ab7 Chord with Ab Bass Note _______________________________________________________ 36

Figure 3-‐19 Residual of Cascade PLP and AR Filter for Polyphonic Input with Bass Note _______________ 38

Figure 5-‐1 Comparison of Cascade Residual Spectrum using MATLAB and CUDA _______________________ 53

viii

List of Tables

Table 3-‐1 Autoregressive Filter Response to Colored Noise ________________________________________________ 17

Table 3-‐2 Signal Whitening of Synthesized Ab Note using PLP and Cascade PLP – AR Structure ________ 19

Table 3-‐3 Autoregressive Filter Response to Recorded Sibilance __________________________________________ 23

Table 3-‐4 Recorded 'Ah' Sound Cascade Filters Residual __________________________________________________ 26

Table 3-‐5 Cascaded PLP and AR Filter Residual with and without Pre-‐Whitening _______________________ 31

Table 3-‐6 Polyphonic Signal Filtering _______________________________________________________________________ 34

Table 3-‐7 Cascade Filters Response to Polyphonic Input Signal with Bass Note __________________________ 37

Table 4-‐1 Simulation Results Comparison for the Autoregressive Filter Residual ________________________ 43

Table 4-‐2 Residual Spectrum Kurtosis for the DSP Implementation of the 3-‐Tap PLP Filter ____________ 44

Table 4-‐3 DSP Processor Profiling Results Comparison ____________________________________________________ 45

Table 5-‐1 GPU Implementation of PLP Processing Time ___________________________________________________ 50

Table 5-‐2 Comparison of Signal Whitening using MATLAB and CUDA ____________________________________ 52

ix

Glossary of Terms and Acronyms

Autoregressive model (AR) – an all-pole model for random processes

Compute Unified Device Architecture (CUDA) – parallel computing architecture

developed by Nvidia. Basis of the architecture of the GPU used in this work

Graphics Processing Unit (GPU) – specialized processor for high-speed image

processing with emerging general-purpose uses that exploit its parallel architecture

MATLAB – numerical computing software package used to develop and verify

algorithms

Monophonic signal – signal with a single fundamental frequency

Pitch Linear Prediction (PLP, 1T PLP, 3Ts PLP) – a form of modeling that depends

on harmonic frequencies in the modeled spectrum. Used in this work for either one-

tap or 3-tap modeling based on suboptimal search.

Polyphonic signal – signal with multiple fundamental frequencies (e.g. piano chord)

Sibilance – unvoiced speech similar to producing the letter ‘s’

Whitening – process of filtering to produce a flattened spectrum (spectrum that is

similar to white noise)

1

Chapter 1 – Introduction

Linear prediction is a technique of mathematical modeling of dynamic time

varying systems. It has wide applications including neurophysics (modeling of brain

activity) [1], geophysics (analysis of seismic traces for oil exploration) [2], and in

speech applications (speech coding and audio compression) [3]. The strength of the

technique lies in its simplicity under wide range of situations. The focus of this work

is in the real-time application of two variants of linear prediction in an audio

application.

1.1. Research Problem

The motivation for this work comes from current research in acoustic feedback

cancellation (AFC) [4]. A recent survey of adaptive acoustic feedback suppression

techniques from the past fifty years has found that acoustic feedback cancellation

(AFC) produced the most promising results, in terms of maximum stable gain and

sound quality, for both hearing aid and sound reinforcement systems [5]. The greatest

challenge in AFC is in reducing the computational complexity, inherent in the use of

high sampling rate in audio applications [5]. This work aims to tackle the most

computational intensive aspect of the real-time implementation of the AFC algorithm.

Linear prediction models are used in closed loop decorrelation of the audio signal

in the AFC algorithm [4]. Comparison of AFC performance with various

decorrelation techniques has found that the use of decorrelating (whitening) pre-filters

to be the preferred method from both sound quality and maximum stable gain points

2

of view [6]. A MATLAB simulation of the complete AFC algorithm has found that

the whitening pre-filters to be the most computationally intensive aspect in the

implementation of the AFC algorithm.

Figure 1-1 MATLAB Profiling of AFC Algorithm Showing Linear Prediction Time

Consumption

In sound reinforcement applications, high audio quality and real-time operation

are a necessity. Therefore, in order to guarantee the implementation of the AFC

algorithm for real-time operation for use in sound reinforcement, implementation of

3

the whitening pre-filters in real-time must be resolved first before the other

components of the AFC algorithm. This is the focus of this work.

1.2. Autoregressive Modeling

The whitening pre-filters, used in the AFC algorithm, are represented by a

cascade autoregressive filter (AR) and pitch linear prediction filter (PLP). In the

figure above, the computation time for the pitch linear prediction filter is under

‘pitch_prediction’ while the autoregressive filter is shown under ‘autocorr’ and

‘levinsondurbin’. Although the autoregressive filter is not the most time consuming, it

is implemented in this work due to its close relationship to the pitch prediction filter.

More generally, autoregressive modeling is the simpler of the two techniques

discussed in this work. Autoregressive modeling has wide use in speech coding [3,7].

Two methods are used in generating the filter coefficients. The autocorrelation

method will be used in this work due to its guaranteed stability [7].

1.3. Pitch Linear Prediction Modeling

The PLP filter also has wide use in speech coding applications [8]. The PLP filter

is used to model quasi-periodicity of the tonal component of speech or audio signals.

The filter is used in the cascade AR – PLP or PLP – AR structure, to remove quasi-

periodicity of the tonal component of the signal and enhance the overall whitening of

the residual spectrum.

4

1.4. Contributions

This work discusses the implementation of the autoregressive and pitch linear

prediction filters for audio signals for application in acoustic feedback cancellation.

The goal of this work is to present a practical implementation of the filters and to

present the applicability of these filters to real audio signals. In Chapter 2, a brief

summary of the theoretical background relating to the two filters is discussed. In

Chapter 3, three test metrics are presented, which are used to analyze the performance

of the filters against synthesized and recorded samples of speech and audio signals.

Two of both filters are discussed in detail. In Chapter 4, a DSP processor

implementation is discussed. The applicability of DSP processor architecture for this

algorithm is discussed. Performance results are demonstrated and discussed.

In Chapter 5, a massively parallel implementation on NVIDIA graphics

processing units (GPU) is discussed. This implementation exploits parallelization

inherent in the PLP algorithm. Performance gains are demonstrated by exploiting

parallelization in the PLP algorithm.

In Chapter 6, a concluding discussion is conveyed which discusses the

significance of the performance gains achieved in the massively parallel

implementation. This chapter concludes with a brief discussion of the complete

implementation of the AFC algorithm and points for further research.

5

Chapter 2 – Theoretical Background

Both autoregressive (AR) and pitch linear prediction (PLP) modeling have had

early success in speech applications. In this section, published literature detailing the

two methods are summarized. This chapter concludes with considerations involving

real-time operation of the filters.

2.1. Autoregressive Modeling using the Autocorrelation Method

Autoregressive modeling is a form of linear prediction that uses an all-pole

system model. The first published use of this model is attributed to Yule [9] in a

paper on sunspot analysis, following dependent work by Kolmogorov and Weiner.

A more comprehensive derivation of linear prediction is included in [10]. Below

is a summary of a few important practical points.

Autoregressive modeling assumes that the input signal can be modeled as a linear

combination of previous outputs. The signal is assumed to be locally stationary

relative to the analysis window.

Several techniques exist for computing the AR coefficients. The Yule-Walker

equations compute the coefficients based on a biased estimate of the autocorrelation

function [11]. The following system of equations is solved

!! ⋯ !!!!⋮ ⋱ ⋮

!!!! ⋯ !!

!!⋮!!

=!!⋮!!

6

using the biased estimate of the autocorrelation function

!! = 1! !!!!!!

!

!!!!!

Following the AFC algorithm paper [4] and decorrelation techniques paper [6],

the AR filter order is set to nc = 30.

2.2. 3-Tap Pitch Prediction Model

In the 3-tap pitch prediction model, we wish to model the input signal using a set

of 3 fractionally delayed coefficients that best fit the input signal. The transfer

function of the predictor is given by

! ! = 1+ !!!!!!(!!!) + !!!!! + !!!!!!(!!!)

where k is a bulk and fractionally delayed lag parameter.

As noted in [8], the spectrum of the derived filter will have a decreasing notch

filtering depth at increasing frequency when −1 ≤ ak < (ak-1 + ak+1) < 0. The prediction

error filter magnitude response is given by [12],

! !!" ! =

cos !" + !! + !!!! + !!!! cos! !

+ sin!" + !!!! − !!!! sin! !

7

The bulk and fractional delay k, represents a delay of T0/Ts which can be arrived

at using numerous fractional delay techniques available [13]. In the AFC paper [4], an

interpolation order of 8 has been suggested, which yields a resolution of 7 fractional

delays between each unit delay.

Several techniques have been proposed for choosing the optimum prediction

coefficients [8] [3] [14]. The prediction error signal of the three-tap fractional delay

predictor is expressed as [8],

! ! = ! ! − !!

!!!

!!!

!(!)!(! − ! + ! − !)!!

!!!!

where p(k) represents the fractional delay method and M is the bulk delay. The error

signal is squared and summed to produce the mean square prediction error.

The best-fit lag M of the three-tap filter is chosen from the optimal lag for the

one-tap pitch predictor [3]. The one-tap predictor features a similar error function

(including fractional delay) but with one coefficient as opposed to three, as shown in

the equation above. A practical technique used to find the best one-tap filter lag is by

obtaining the one tap coefficient using Σ e2 minimization for the one-tap case and

filtering the input signal [12]. The lowest mean square prediction error signal is

chosen for the three-tap coefficients derivation.

Minimization of Σ e2 for the three-tap predictor yields three linear equations,

which can be solved trivially. It should be noted that decreasing notch-filtering

8

spectrum with increasing frequency is guaranteed when the center coefficient is larger

than the side coefficients (representing β-1 and β1) [8]. This condition can be forced

by setting β0 coefficient to the one-tap predictor coefficient [3].

2.3. Issues in Real-Time Implementation

Processing of the input signal must be window based and real-time. A minimum

window length corresponds to at least two times the lowest expected fundamental

frequency. This is necessary in order to identify the input fundamental based on the

prediction error.

The AFC paper suggests the pitch search range to be from 100 Hz to 1 KHz. At

44.1 KHz sampling rate this corresponds to a minimum of 882-sample window.

On the other hand, an upper limit to the window size is the assumed short-term

stationarity of the signal. A window of 882 samples represents a 20ms time frame.

In addition, computational complexity of the algorithm as a function of window

length should be considered. Using an interpolation rate of 8 and a search range of

100 Hz to 1 KHz at 44.1 KHz sampling rate, 3176 total fractional delays are searched.

Using the practical approach to identifying the best lag M, this results in 3176

filtering operations to determine the prediction error.

The AFC paper suggests window size of 40 to 50ms. A window size of 2048,

corresponding to 46.4ms was chosen for the massively parallel implementation.

In the next chapter, simulation experiments are done to analyze the efficacy of the

whitening filters for a variety of expected input conditions.

9

Chapter 3 – Autoregressive and Pitch Prediction Filters Performance

In this chapter, the AR and PLP filters are implemented and tested against four

classes of input conditions. These input signals are meant to test the two filters against

various types of possible inputs. In a practical situation, the algorithm may receive an

infinite number of combinations of input conditions. Therefore, the discussion will focus

on a few important types.

The four classes of test signals consist of speech and audio signals. The goal of the

linear prediction filters is to model the input under diverse input conditions. The inverse

signal model is then used to suppress the dominant characteristics of the signal in order to

whiten the output. Most audio signals contain a periodic component in the spectrum but

typically will also contain a wideband aperiodic component as well. The ratio between

the two may not be known beforehand and may change from one sample window to the

next.

The first class of input signals is composed of synthesized test signals. Two

synthesized input signals are used to test the filters. The first input signal is classified as a

colored aperiodic signal and the second input signal is a synthesized Ab musical note. The

colored aperiodic signal will test the autoregressive filter independently, while the

synthesized musical note will test both independent and cascaded combinations of the AR

and PLP filters.

The second class of inputs consists of recorded speech signals. A recorded sibilance

signal of a male voice producing the sound ‘s’ is used to test the autoregressive filter. The

second speech signal is a recorded male voice producing the sound ‘ah’. The two speech

10

signals are tested with both independent AR and PLP filters and the cascaded

combination.

The third class of test inputs is the monophonic audio signal. This class of input

signals represents the target class of inputs in the AFC algorithm [4]. This test signal is a

recorded Ab piano note. The algorithm specifies a cascaded AR – PLP – AR filter

combination. Comparison of pre-whitening using the AR filter for the cascade PLP and

AR model filters will be done and compared to the non-prewhitened cascade structure.

The fourth and final class of test signals is the polyphonic audio signal. Two test

signals are used to test the cascade PLP and AR structure. The first is an Ab piano chord

and the second is an Ab piano chord with a bass note. No mention of the applicability of

the AFC algorithm to polyphonic signals is made in published AFC literature. Only a

brief analysis is done in this work. However, the extension of the whitening filters to

polyphonic signals is necessary due to the prevalence of polyphony in contemporary

music.

3.1. Test Filters Conditions

The two linear prediction filters being analyzed consist of a short (30-tap)

autoregressive and a pitch linear prediction filter. The autoregressive filter is

implemented using the autocorrelation method [7]. The PLP filter is implemented using

fractional delays (interpolation order = 8) with a pitch search range from 100 Hz to 1

KHz. Two types of PLP filters are compared, the 1-tap, 3-tap PLP filters, both of which

are fractionally delayed. The simplest of the two is the 1–tap PLP filter whose frequency

response has a uniform comb filter structure across the Nyquist bandwidth. The second

11

filter is the 3-tap filter, which finds the optimal bulk and fractional delay based on the

1-tap PLP residual and designs a 3-tap PLP filter based on the identified bulk and

fractional delay (using 3 degrees of freedom in the 3-tap coefficient least squares

minimization) [8].

The input signal is fractionally delayed using a polyphase interpolation FIR filter

structure with a 160-order low pass filter. The fractional delayed 1 and 3-tap filter

coefficients are derived using a 20 order delayed sinc interpolation [13].

All signals are sampled at 44.1 kHz and are 1024 samples in length (representing a

sample window of 23ms). Stationarity of the signal is assumed at these window lengths,

while no long-term stationarity is assumed. Therefore, each window pitch identification

search is done independently relative to previous iterations.

3.2. Test Metrics

Three primary test metrics are used to determine the efficacy of the whitening filters.

The first metric is kurtosis of the residual signal spectrum. This measure is used to

determine the degree to which the probability mass is distributed between the shoulders

of the distribution to its center [15]. Formally, it is defined as

! =!(! − !)!

!!

It is also known as the standardized fourth moment. It will be used here to measure

how outlier prone is the distribution of the spectrum of the residual signal. The fourth

12

power in the formula results in a wide variation in kurtosis of the test signals (between

single digits to hundreds). The normal distribution has a Kurtosis of 3. Lower values of

kurtosis signify whiter residual.

The second metric, the residual autocorrelation power weight (RAPW), measures the

degree of aperiodicity of the residual in the autocorrelation power domain. It can be

stated as the ratio of the power of the zero order autocorrelation with respect to the mean

power of all remaining autocorrelation lags. Higher values signify a whiter residual

spectrum.

!"#$ = !"#$%$&&(!, 0) !

1! !"#$%$&&(!, !) !!!!

!!!

The third and final test metric is the residual spectral flatness measure (SFM). This

measure was introduced by Gray and Markel [16] and is common in audio signals

whitening literature [12]. This measure examines the average spread of the spectrum in

the frequency domain.

!"# =!"# (1/!) ln ! !!!!!/! , !!!!

!!!

(1!) ! !!!!"/! , !!!!!!!

It is normalized so that a white residual spectrum has an SFM of 1. Values of SFM

are always positive.

13

3.3. Synthesized Test Signals

The AR and PLP filters are tested using synthesized models of real inputs. If the filter

performs well against these test signals, real input signals can then be used to test the

filters. This ensures that the filter behaves as expected against the modeled test signals.

The first class of test signals represents synthesized colored noise and a synthesized Ab

musical note.

3.3.1. Colored Noise

The first category of input signals is colored noise. In a practical situation, the source

of noise can be electrical and acoustic. In reality, the noise signal itself may be a desirable

aperiodic aspect of the input signal (e.g. guitar distortion). In acoustic musical

instruments, the presence of the aperiodic signal identifies the difference in quality

between two identical instrument types, or two different musicians. It may be helpful to

identify colored noise in this context as being the wide band aperiodic signal of which the

pitch of the harmonics of the tone dominates when viewed in the frequency domain.

14

Figure 3-1 Colored Noise Source using a Butterworth Filter fc = 3 kHz

In speech applications, the noise signal source can arise as a result of physical

characteristics of the vocal tract in addition to the desired aperiodic sound of which the

speaker is producing. An example of this type of sound is the sound produced by the

letter ‘s’, which is referred to as sibilance. In the analysis that follows, a recorded

sibilance signal sample contain a male voice producing the sound ‘s’ as part of the word

‘eins’ (German for ‘one’).

15

3.3.1.1. Autoregressive Filter Response to Synthesized Colored Noise

Using colored noise as an input, the AR filter was tested using a 30-tap filter as in [4].

The frequency response of the filter is show below, followed by the Pole-Zero plot.

Figure 3-2 Frequency Response of the AR Filter with Colored Noise Input

The frequency response of the AR filter shows that the filter has correctly identified

the envelope of the spectrum. The lower frequency region of the input signal has higher

power while the high frequency region has lower power. The residual spectrum is

expected to be even across all frequencies.

16

Figure 3-3 Pole-Zero Plot Demonstrating AR Filter Response

The distribution of the zeros on the unit circle demonstrates the wide band

characteristic of the filter. However, the plot also shows that numerical accuracy is

critical issue, due to the proximity of the zeros to the unit circle. Numerical errors can

cause instability in the practical implementation filter response. The residual plot figure is

shown below. The signal shows a peak 25dB suppression, but more importantly flattened

overall residual.

17

Figure 3-4 AR Filter Residual Signal Showing a Flattened Spectrum

The residual signal spectrum is both analyzed using the three metrics described

above. All metrics agree in that an improvement has been made in the whitened residual

signal.

Table 3-1 Autoregressive Filter Response to Colored Noise

RAPW Kurtosis SFM

Input Signal 276 8.417 0.214

AR Output Signal 1065 3.049 0.494

18

3.3.2. Synthesized Ab Note

A synthesized musical note is modeled to represent an Ab note on a modern equal

tempered piano. When using middle A (A4 - fourth octave note ‘A’), the note Ab (one

semitone lower than the note ‘A’) has a fundamental frequency of 415Hz [17]. The

synthesized note has a fundamental in the middle range of the PLP search bandwidth.

The synthesized note is designed with five harmonics total (including the fundamental)

each decreasing by 3 dB in amplitude.

Figure 3-5 Synthesized Ab Note Frequency Spectrum

19

3.3.2.1. Cascade Filter Response

Due to the presence of tonal components in the synthesized audio signal, the signal is

whitened using a cascade of the pitch and AR filters. The signal shows significant

improvement all test metrics, which confirms the efficacy of the PLP filter for the

modeled tonal signal. The recorded Ab note example in the next section also compares

the cascade structure with the addition of a pre-whitening AR filter. Nevertheless, the

results below show significant improvement by using the two-stage structure. Except for

the residual spectrum kurtosis, the cascade 3-tap PLP and AR filter show the best overall

results.

Table 3-2 Signal Whitening of Synthesized Ab Note using PLP and Cascade PLP – AR Structure

RAPW Kurtosis SFM

Input Signal 10.86 198.9 0.115

1T PLP 53.94 44.34 0.116

3Ts PLP 159.0 16.87 0.474

1T PLP – AR 1362 4.245 0.799

3Ts PLP – AR 2256 4.972 0.879

20

Figure 3-6 Residual Spectrum after Cascaded PLP and AR Filters Whitening

Note that the strong peak in the residual spectrum shown above does not correspond

to the original pitch harmonics.

3.4. Recorded Test Signals

Tests with real input signals are done to ensure that the designed filters behave to the

desired real-world purpose. The comb-filter characteristic of the PLP filter demonstrated

good performance when applied to the modeled tonal signal. However, deviations of real

signals from the model determine the practical efficacy of the Linear Prediction filters.

21

A set of speech signals is tested first. Both the PLP and AR filters were first

introduced for speech filtering applications, and enjoy wide use in speech coding

applications [8] and are expected to perform well with recorded speech signals.

Recorded musical notes are then tested to determine the efficacy of the approach to

music signals. The recorded music signals are all piano recordings. These signals

comprise of both monophonic and polyphonic test signals.

3.4.1. Recorded Speech Signals

Two recorded male voice speech signals are used. The first signal will be used to test

the autoregressive filter. This signal is a recorded sibilance and the second signal is a

vocalization of the vowel ‘ah’ which will be used to test the PLP filter as well.

3.4.1.1. Speech Sibilance Signal

This signal is comprised of sound of the letter ‘s’ in the word ‘eins’ (German for

‘one’). The signal does not have a definite perceived pitch. Below is the spectrum of the

input signal. This signal will be filtered using the AR filter independently in addition to

the cascade PLP and AR combination.

22

Figure 3-7 Recorded 's' Sound from a Male Voice

3.4.1.1.1. Autoregressive Filter Response to Recorded ‘s’ Sound

After AR filtering, the signal shows large improvement in overall whitening. The 30-

Tap AR filter was able to model the envelope characteristics of the signal, which shows

higher spectral complexity compared to the original synthesized colored noise signal. The

cascade PLP and AR filters are also compared in order to determine if the structure can

remain fixed without consideration for the type of input signal received.

23

Table 3-3 Autoregressive Filter Response to Recorded Sibilance

RAPW Kurtosis SFM

Input Signal 37.76 65.18 0.268

AR Filter 1256 4.512 0.821

AR – AR 2100 3.475 0.851

1T PLP 77.03 31.49 0.245

1T PLP – AR 1411 4.178 0.817

3Ts PLP – AR 1365 3.640 0.786

The results above show that due to the lack of strong periodic component in the input

signal, most of the signal whitening was done by the AR filter. Nonetheless, the addition

of the PLP filter did not have a detrimental effect on the output residual. This test shows

that the cascade structure intended for audio signals will perform well with speech

signals. Although the 3-tap PLP structure did not have the best results for all three

measures, the results show some improvement over the single filter case except in the

spectral flatness measure. The cascade two-stage autoregressive filter (filtering using the

previous set of coefficients and the current set) resulted in the best spectral flatness. This

confirms the operation of the filters based on the input signal structure.

24

Figure 3-8 Whitened Recorded 's' Signal Filtered using Autoregressive Filter

This figure above confirms the analysis above in that in the case of the recorded

sibilance signal, no strongly periodic signal seems to be present.

The next audio sample is a speech signal sample with a periodic component in the

input signal spectrum.

3.4.1.2. Speech Vowel Signal

The second recorded signal is comprised of the ‘ah’ vocalization in the beginning of

the word ‘eins’ (IPA: aɪ̯ns [18]). This signal is recorded from a male voice. The signal is

used to test the cascade PLP and AR filter structures.

25

Figure 3-9 Recorded 'ah' Sound from a Male Voice Input Signal Spectrum

3.4.1.2.1. Autoregressive Filter Response to Recorded ‘ah’ Sound

The overall cascade structure of the AR and PLP filters determines the overall

response of the whitening filters.

Figure 3-10 Cascade PLP and Autoregressive Filters Structure

3-‐Tap PLP

30-‐Tap AR

26

In this configuration, the 3-Tap PLP filter residual is applied to the AR Filter.

However, pre-whitening is often applied in speech [3] and audio [3] applications to

flatten the (often decaying) spectrum of the input signal. The pre-whitening filter is an

AR Filter with coefficients from the previous sample window.

Figure 3-11 Cascade Filter Structure with Pre-Whitening Filter

The following table shows signal whitening after single and multistage filtering.

Table 3-4 Recorded 'Ah' Sound Cascade Filters Residual

RAPW Kurtosis SFM

Input Signal 44.96 47.63 0.276

AR Filter 366.74 13.18 0.776

1T PLP 84.25 24.19 0.143

1T PLP – AR 2242 2.955 0.844

3Ts PLP – AR 2119 3.658 0.851

AR – 1T PLP – AR 2555 2.921 0.872

AR - 3Ts PLP - AR 2772 3.568 0.888

30-‐Tap AR

Pre-‐Whitening

3-‐Tap PLP

30-‐Tap AR

27

The table above shows that pre-whitening yields the best results for the speech signal

with a tonal component. It is interesting to note that the 1-tap PLP filter achieved

comparable results to the 3-tap PLP filter. However, pre-whitening had a positive effect

on both the cascade 1-tap and 3-tap PLP filters. It seems that the 3-Tap suboptimal search

filter does not yield a substantial amount of performance improvement in the case of

tonal speech signal. Below is the spectrum of the signal after AR filtering and also after

the cascaded AR-PLP-AR structure.

Figure 3-12 Residual Spectrum of ‘ah’ Vocalization after AR Filtering

28

The figure above shows the persistence of the tonal components in the signal

spectrum after the AR filter. This is expected since the AR filter is meant to flatten the

general envelope of the complete spectrum.

The figure below shows lower energy in the lower part of the frequency range. This

represents the action of the 3-Tap PLP filter performance, which suppressed the voiced

tonal component, and the final stage of AR filtering. In addition, the overall signal

dynamic range is greatly reduced. This is the desired response from the cascade filter

structure.

Figure 3-13 Residual Spectrum of Cascade Filters of Recorded 'ah' Sound

29

The mixture of strongly periodic characteristics and decaying broadband aperiodic

component is typical for many recorded signals; as will be evident from the following

audio samples. In the example above, the final output of cascade filters shows a dynamic

range of approximately 15dB compared to the input signal dynamic range of 55dB.

3.4.2. Recorded Musical Notes

First, analysis of monophonic input signals is discussed followed by polyphonic audio

samples. Musical notes are expected to have a more complex spectrum when compared to

speech signals. The strength of the periodic to the aperiodic components is another factor

that should be identified.

3.4.2.1. Monophonic Audio Signal - Piano Note

An audio sample was recorded of an actual piano Ab note. The input signal spectrum

is shown below. Note the mixture of odd and even harmonics along with their relative

intensities.

30

Figure 3-14 Recorded Ab Piano Note

The spectrum shows a complex harmonic structure that extends beyond 15

harmonics. The fundamental is at approximately 415Hz, which is within the search range

of the PLP filter. The input signal dynamic range is approximately 70dB.

3.4.2.1.1. Cascade Pitch and Autoregressive Response – Monophonic

Input Signal

Below is a table showing the results of cascade filtering with and without pre-

whitening filter. The cascade structure with pre-whitening shows a slight improvement in

31

overall whitening of the residual spectrum. Nonetheless, the 3-tap PLP out-performs the

1-tap PLP in both the pre-whitened and non-prewhitened cases.

Table 3-5 Cascaded PLP and AR Filter Residual with and without Pre-Whitening

RAPW Kurtosis SFM

Input Signal 24.62 87.52 0.257

AR Filter 90.502 32.01 0.670

1T PLP 83.95 25.69 0.115

3Ts PLP 110.3 19.15 0.107

1T PLP – AR 713.93 9.584 0.774

3Ts PLP – AR 1290 10.69 0.845

AR – 1T PLP – AR 982.5 9.481 0.828

AR - 3Ts PLP - AR 1550 9.250 0.865

In the figure, below, the PLP filter output shows that the harmonics of the recorded

Ab piano note are not exact integer harmonics as modeled in the synthesized monophonic

signal. This is an important realization, in that the filter behaves as designed, but that the

actual signal does not behave as the ideal model. Filtering of the first few harmonics

seems to be effective, however at higher frequencies, the harmonics deviate from the

fundamental.

32

Figure 3-15 Cascade PLP Filter Output Residual with Pre-Whitening

Error due to deviations in phase at higher frequencies, relative to the fundamental,

results in a larger difference in the identified frequencies. This is the reason why the first

harmonics have been effectively suppressed when compared to higher harmonics.

Nonetheless, the use of the 3-tap filter with variable envelope seems to be the appropriate

choice when compared to the constant notch filtering 1-tap filter.

33

3.4.2.2. Polyphonic Audio Signal – Piano Chord

Given the predominance of polyphony in music. It is important to test the filters with

the most common audio signal type. The comb filtering structure may be sufficient to

suppress the strongest harmonics in the polyphonic signal.

Below is a spectrum of an Ab7 piano chord. The spectrum shows a mixture harmonics.

However, since the PLP filter converges on the lowest prediction error estimate, it should

be able to suppress the strongest harmonics of the polyphonic signal.

Figure 3-16 Polyphonic Test Signal with an Ab7 Piano Chord

34

3.4.2.2.1. Cascade Pitch and Autoregressive Filters Response –

Polyphonic Input

The piano chord is filtered using both with and without pre-whitening methods.

Results are show below.

Table 3-6 Polyphonic Signal Filtering

RAPW Kurtosis SFM

Input Signal 20.41 104.2 0.111

AR Filter 912.7 17.73 0.888

1T PLP 41.15 52.58 0.162

3Ts PLP 48.02 45.20 0.210

1T PLP – AR 999.6 8.369 0.824

3Ts PLP – AR 1388 7.817 0.859

AR – 1T PLP – AR 1560 12.48 0.859

AR - 3Ts PLP - AR 2048 9.180 0.888

Significant reduction in Kurtosis is shown that is comparable to the monophonic

signal case. The residual spectrum still contains a large part of its harmonic quality,

although the spectrum is significantly flattened. The final results using the pre-whitened

3-tap filter cascade structure are comparable to the monophonic case. This shows that the

AFC algorithm may perform well with polyphonic input signals.

35

Figure 3-17 Cascade PLP and AR Filters Residual for Ab7 Chord Polyphonic Input

3.4.2.3. Polyphonic Audio Signal – Piano Chord with Bass Note

In addition, a piano chord (same as in the previous section) with a bass note is used to

test if the bass note helps in the PLP algorithm’s identification. The bass note is an Ab2

(second octave Ab, with approximately 103Hz fundamental frequency [17]).

36

Figure 3-18 Recorded Ab7 Chord with Ab Bass Note

3.4.2.3.1. Cascade Pitch and Autoregressive Filters – Polyphonic Input

with Bass Note

The polyphonic input signal is tested in the cascade configuration with and without

pre-whitening. The results show that pre-whitening had a small positive effect in RAPW

and spectral flatness but had a slight negative effect in terms of residual spectrum

kurtosis. However, overall results are comparable to the monophonic signal test case.

37

Table 3-7 Cascade Filters Response to Polyphonic Input Signal with Bass Note

RAPW Kurtosis SFM

Input Signal 45.22 47.38 0.218

AR Filter 123.2 21.71 0.614

1T PLP 63.97 32.83 0.059

3Ts PLP 63.00 33.41 0.055

1T PLP – AR 1273 6.392 0.810

3Ts PLP – AR 1799 5.060 0.845

AR – 1T PLP – AR 1889 7.094 0.879

AR - 3Ts PLP - AR 1864 5.160 0.849

The figure below shows that, similarly to the polyphonic test case, the harmonic

content remained after overall filtering. Only a small difference between the two

pre-whitened and non-prewhitened structures is shown in terms of RAPW. However, the

overall dynamic range of the signal is greatly reduced.

38

Figure 3-19 Residual of Cascade PLP and AR Filter for Polyphonic Input with Bass Note

3.5. Discussion

In this chapter, a diverse range of inputs was applied to the PLP and the

autoregressive filters. In addition, efficacy of the pre-whitening filter was tested for the

tonal audio samples.

This chapter demonstrated that the pre-whitened cascade 3-tap PLP and

autoregressive filters had, in most cases, the best overall whitened spectrum with respect

to all three test metrics. Experimental results have shown that this structure had

comparable overall whitening for the monophonic and polyphonic test cases. This shows

that the cascade model can be used for both monophonic and polyphonic input signals.

39

One important finding is that experimental results have shown that harmonics of

recorded audio signals are not necessarily related by an integer number. This degraded

the performance of the comb filter at higher frequencies. One possible reason may be due

to the frequency resolution of the interpolated 3-tap PLP filter. Karpus and Strong [19]

have shown that musical instrument modeling can be achieved using fractional delays

and 30 sine wave generators to produce a realistic timbre. Modification of the comb filter

to have a wider bandwidth and increased suppression would be desirable for real test

signals. This is because recorded monophonic and polyphonic signals have shown that

the tonal components of test signals are significantly stronger than the aperiodic

components.

40

Chapter 4 – DSP Implementation

Migration of the linear prediction filters to an embedded DSP processor depends on

the capability and resources available in the embedded architecture. The autoregressive

filter requires solving the Yule-Walker equations using matrix inversion. The 3-Tap PLP

filter requires the calculation of the residual mean square prediction error for each search

interval in order to find the best-fit coefficients.

4.1. Challenges in Implementation

The autoregressive filter does not require as much memory as the PLP filter.

However, it requires inverting a 30x30 reflection coefficients matrix [19]. This can

present a significant amount of computation, which may prevent the algorithm from

achieving real-time performance.

Two matrix inversion methods will be investigated, the first is the Levinson-Durbin

recursion, which requires O(n2) computations [3]. The second matrix inversion method is

the Gauss-Jordan method, which requires O(n3) computations [20].

In the 3-tap PLP filter implementation, memory consumption is an important issue.

This is because the large search region of the filter and 8 times interpolation that is

required. Many implementations exist [21] for fractional interpolation. Polyphase FIR

was chosen due to the fact that one filter can be used to produce all eight fractional

delays. It is important to note that the FIR input and output signal length required is eight

times the original window size (for a 1024 sample window in the DSP implementation

resulted in an 8192 output signal). On the other hand, this implementation makes all eight

41

fractional delays available for the entire search region using one filter (through the use of

multiple starting positions for the output and increment in the address by the interpolation

order).

4.2. Target Architecture

The chosen processor for the hardware implementation is the Analog Devices

SHARC 21369 400MHz Floating Point DSP processor. The processor uses a modified

Harvard architecture with separate data and instruction buses. The processor allows

SIMD-Single Instruction Multiple Data, which is beneficial for fast FIR processing. The

processor contains two computational units that allow simultaneous computation of an

instruction on two sets of data. The combination of SIMD and modified Harvard

architecture allow four operands and one instruction fetch in a single cycle.

On chip memory is made up of 2Mbit shared program and data memory (allows total

65k 32-bit words). In order to exploit SIMD, data and program instructions must be

located in their respective memory regions. The processor also contains two data address

generators that support circular buffers in hardware.

4.3. DSP Processor vs FPGA

The floating point DSP processor was chosen over an FPGA implementation. In

addition to the difficulty presented in fixed-point implementation, FPGAs tend to be

slower than DSP processors in terms of core clock speeds. In addition, parallel

computation would not have been possible with a low-cost FPGA device. Xu et al [13]

42

have shown that implementation of Levinson-Durbin algorithm for coefficients

calculation in speech applications consumes 16,254 Configurable Logic Blocks (CLB) on

a Xilinx Virtex-E device. In their implementation, the maximum clock frequency was

limited to 13.4MHz. Notwithstanding the fact that the 3-Tap PLP filter has significantly

higher computational complexity compared to the autoregressive filter. Therefore, a DSP

processor implementation is more likely to be able to achieve real-time performance.

4.4. DSP Processor Performance Results

In the DSP processor implementation, the window length was limited to 1024

samples. This was done in order to reduce overall memory consumption. The window

size allows holding in memory up to a maximum of two periods for the lowest frequency

in the search window (100Hz, 441 samples at 44.1 KHz sample rate). The MATLAB

comparison results are generated using the same 1024 sample window.

The polyphase interpolation ratio was kept at 8 times similar to the MATLAB

simulations in the previous chapter. Also, the autoregressive filter was set to 30-taps and

the 3-Tap PLP filter was set to search from 44 to 441 samples (with 8 fractional delays).

The 3-Tap PLP filter was set to 3 degrees of freedom in the coefficient estimation.

4.4.1. Recorded Sibilance – AR Filter Testing

The recorded male voice vocalization of ‘s’ as analyzed in the previous chapter was

used in the Autoregressive filter testing. The filter was tested using a 1024 sample

window.

43

Calculation of the Autoregressive filter coefficients was done using Levinson-Durbin

in the MATLAB implementation. In the DSP implementation, both the Levinson-Durbin

and the Gauss-Jordan method were implemented. Experimental tests have shown that the

Levinson-Durbin method on the DSP processor was more susceptible to computational

errors. This is most likely due to the fact that the recursion improves current estimates of

the matrix inverse based on the previous estimate. Therefore, computational errors can

accumulate between each iteration. Nonetheless, the DSP implementation showed lower

residual Kurtosis for both Levinson-Durbin and Gauss-Jordan methods and identical

residual mean squared prediction error compared to the MATLAB implementation.

Table 4-1 Simulation Results Comparison for the Autoregressive Filter Residual

Input Signal Kurtosis AR Residual Kurtosis

MATLAB 84.64 4.629

DSP Levinson-Durbin 84.64 4.628

DSP Gauss-Jordan 84.64 4.625

Kurtosis was the only test metric used during preliminary testing. Further testing was

not conducted after implementation of the PLP filter.

44

4.4.2. Recorded Ab Piano Note

The PLP search method used for the 3-tap PLP filter in the MATLAB simulation is

based on choosing the lowest one tap residual prediction error, using all fractional and

integer delays. The lowest prediction error lag is then used to find the 3-Tap coefficients.

Previous research [3,8] has shown that the suboptimal search method for the 3-tap

coefficients yields acceptable results at lower computational cost. Nonetheless, this can

present a significant computational load on an embedded processor. Below are results

based on the residual spectrum of the 3-tap PLP filter.

Table 4-2 Residual Spectrum Kurtosis for the DSP Implementation of the 3-Tap PLP Filter

Input Signal Kurtosis Residual Signal Kurtosis

MATLAB 3-Tap PLP 99.64 20.52

DSP 3-Tap PLP 99.64 20.72

4.4.3. Processor Profiling

Processor profiling was done to determine the amount of time required to compute

the coefficients for the AR and PLP filters. Since processing is window based, the filter

coefficients must be computed in time before the next window is ready. At 44.1 KHz,

23.22 ms are available for computation of the coefficients. Below are results from the

DSP implementation of the autoregressive and 3-tap PLP filter. MATLAB computation

45

results are tabulated for relative comparison (done on 2008 Model Macbook Pro with

2.4GHz Intel Core 2 Duo Processor and 4GB of RAM).

Table 4-3 DSP Processor Profiling Results Comparison

30-tap AR

Levinson-Durbin

30- tap AR

Gauss-Jordan

3-tap PLP

Suboptimal

MATLAB 0.008 s 0.014 s 0.838 s

DSP 1.927e-4 s 8.423e-4 s 4.119 s

These results are at 400MHz DSP Processor speed. Calculation of the 30-tap

autoregressive coefficients is done well below the available processing time. The results

show that the DSP processor is able to compute the coefficients much faster than the

MATLAB implementation.

In the case of the PLP filter coefficients computation speed of the coefficients is very

poor. This is because most of the DSP processor features are severely degraded if all the

required data is not made available within the on-chip RAM. The combination of the high

data lengths and on-chip RAM segmentation necessitated the use of external SDRAM

memory. This resulted in a significant degradation in performance of the processor.

However, these results show that the DSP processor may not be the best candidate for the

PLP algorithm.

46

4.5. Problems Encountered

Most problems in the DSP implementation were related to memory constraints. Since

the processor is a floating-point processor, few numerical issues were encountered.

Nonetheless, hardware division accuracy was a problem when computing the

autoregressive filter coefficients.

4.5.1. Memory Segmentation

The processor specification sheet lists the DSP chip as having 2Mbits of RAM,

which should be sufficient for this algorithm. Unfortunately, the memory is

segmented in to four blocks. Two blocks held .75Mbits or RAM while the second two

held .25Mbits. The two .25Mbits memory blocks held program stack and heap

(separately). In addition, on-chip RAM is used to store the actual program. One of the

.75Mbits memory blocks held program code. This combination made memory

management very challenging. The new generation SHARC processors, although run

at a maximum of 400MHz, contain 5Mbits of onboard memory (separated in to two

memory blocks), in addition to having FIR, IIR and FFT hardware accelerators.

4.5.2. Stack Overflow

The DSP processor experienced stack overflow only when a function is called that

contains large data vectors. However, the problem disappeared when the large data

vectors were declared as global variables although they were nonetheless mapped to

the same memory segment. This issue is not documented.

47

Data had to be expanded in to SDRAM. SDRAM clock runs at 133MHz. In

addition, it takes multiple clock cycles to transfer data to the core from external

memory.

4.5.3. Hardware Division

Hardware division in implemented on the DSP processor using the Newton-Raphson

method. This allows successive approximation of the inverse of the divisor, which is then

multiplied by the dividend. Sufficient numerical accuracy was achieved using one

iteration of the loop (provides approximately 1e-10 precision).

4.6. Discussion

This chapter demonstrated the DSP implementation of the Autoregressive and PLP

filter. Performance results of the AR and PLP filters show comparable results to the

MATLAB implementation. This verified the implementation of the whitening filters on

the DSP processor.

Although the autoregressive filter showed significant speed gains compared to the

MATLAB implementation, the PLP filter implementation showed severely degraded

computational speed. This is due to the use of external memory, which is necessitated by

the large data required in the computation of the algorithm. A different type of

interpolation filter may be attempted to reduce memory requirements for the algorithm

although the implementation processing time is extremely high for a practical real-time

implementation.

48

Chapter 5 – GPU Implementation

Following a recent paper regarding audio signal processing using graphics

processing units [23], implementation of the whitening filters on GPUs was investigated.

Conceptually, the PLP algorithm seems to be well suited for parallelization due to the

independence of the residual mean square error in each bulk and fractional delay.

5.1. Target Architecture

Graphics processing units are a class of massively parallel computational machines

designed for high throughput graphics applications. They have enjoyed wide use in

scientific applications which are not necessarily image processing related. Differences

exist between GPUs from competing manufacturers. The chosen GPU is the NVIDIA

Geforce GTX 460 with 768MB of GDDR5 on board ram. The GTX 460 has 336 cores,

operating at a 675 MHz. The GDDR5 device memory bandwidth is 86.4 GB/sec.

NVIDIA GPUs are programmed using the CUDA (Compute Unified Device

Architecture) programming environment. NVIDIA GPU devices are listed in categories

identified by the Compute version. The GTX 460 features compute version 2.1, which is

capable of 64-bit floating-point arithmetic. CUDA code is compiled using NVIDIA nvcc

while runtime C code is compiled using Microsoft Visual Studio 2008. Runtime

breakpoints are available when using a dedicated video card for algorithm development

and a second for video display.

49

5.2. Algorithm Implementation

GPU programming is done by transferring data from host memory on to the GPU

memory followed by kernel execution on the GPU. Therefore, data that is computed

on the GPU must gain sufficient performance gain that minimizes the cost of

transferring data to and from host memory. NVIDIA compute 2.1 devices are capable

of simultaneous data transfer and kernel execution, although this feature was not used

in the implementation code.

Analysis of the PLP and AR whitening filters shows that parallelism can be

exploited in PLP filter calculation, substantially more than the AR filter. The AR

filter is much simpler in terms of computation, as is evident in the SHARC DSP

processor implementation. Nonetheless, both algorithms were implemented.

5.2.1. PLP Filter Implementation

The PLP filter implementation is done in three stages. The first stage is in

calculation of the one tap filter coefficient for each fractional and bulk delay in the

search range (100 Hz – 1 KHz). The second stage is the calculation of the residual

mean square error for all fractional and bulk delays. The final stage is the

identification of the minimum error fractional delay and generation of the 3-tap filter

coefficients. Since the 3-tap filter coefficients only require a 3x3 matrix inversion,

this calculation was done on the CPU.

A polyphase fractional delay filter was used for input signal fractional delay.

Input window size was set to 2048 with 25% overlap. Filtering was done using

frequency domain convolution in the case of each bulk and fractional delay for mean

50

square error computation. Below is a table with the average computation time for

each step. Note that computation time varies slightly in each experiment (within tens

of milliseconds) due to operating system interrupts that affect program flow.

Table 5-1 GPU Implementation of PLP Processing Time

Method Computation Time

Polyphase FIR 0.6 ms

One Tap Filter Coefficients

(for all bulk and fractional delays)

0.8 ms

Residual Convolution

(for all bulk and fractional delays)

4.2 ms

Residual MSE Calculation 8.5 ms

Total PLP Computation Time 14.9 ms

Residual convolution is performed using a batched FFT convolution of 3176 sample

windows (100 Hz to 1 KHz search range with 8 fractional delays at 44.1 KHz) of 2048

samples each. This required a large amount of memory (approximately 50MB when

using 32-bit float for the real and imaginary component of each frequency bin), which is

certainly available on the GPU device memory (GDDR memory).

The long processing time associated with residual MSE calculation is most likely due

to inefficient memory accesses. Although further optimization could have improved

efficiency of the algorithm, the PLP algorithm still maintained real-time criterion (34 ms

available between sample windows).

51

5.2.2. AR Filter Implementation

The autoregressive filter was first implemented on the GPU. Preliminary results

showed that execution time took almost 20 ms. This was unexpected due to the small

amount of computation necessary. However, the bottleneck of the algorithm was in

numerous memory transfers between host and device memory, which was needed to

regulate the algorithm flow. This showed that this particular algorithm is ill suited for

GPU implementation. Therefore, a CPU implementation was done with considerable

speed improvement.

Since the autoregressive filter relied on the input signal autocorelation estimate, fast

computation of the autocorrelation lags was done on the GPU. The Wiener-Khinchin

theorem states that autocorrelation of a discrete sequence is the product of the sequence

and it’s complex conjugate in the frequency domain. This proved convenient since the

AR filter is computed after the PLP filter, which used FFT convolution for output signal

filtering.

Once the autocorrelation lags are available, the coefficients of the AR filter are

estimated using matrix inversion. GNU Scientific Library was used for matrix inversion.

The library provides optimized code for Linear Algebra computations on general-purpose

processors. Inversion of the 30x30 matrix took 0.6 ms on a 3.0 GHz Intel Core i3

development computer with 8 GB of RAM.

52

5.3. Numerical Accuracy in CUDA Implementation

Small deviations in numerical accuracy were observed in the CUDA implementation.

Comparison of signal whitening using MATLAB and CUDA is available below.

Table 5-2 Comparison of Signal Whitening using MATLAB and CUDA

RAPW Kurtosis SFM

Input 12.7 244.6 0.024

MATLAB Output 605.4 21.4 0.798

CUDA Output 198.4 36.7 0.755

MATLAB computations were done on double precision floating point. Although the

CUDA implementation was done on single precision floating point arithmetic, double

precision capability is available at a higher cost in memory and computation speed.

Nonetheless, the implementation proved its efficacy when compared to the MATLAB

implementation.

Further confirmation of algorithm efficacy is demonstrated in the figure below. The

figure shows a comparison of the output from MATLAB and CUDA of cascade 3-tap

PLP and AR filters using pre-whitening. Overall, the spectrum peaks seem to be very

close to each other. On the other hand, the CUDA implementation seems to have resulted

in higher suppression of some high frequency periods and a slightly different AR

spectrum for the highest frequency range (8 K-16 KHz).

53

Figure 5-1 Comparison of Cascade Residual Spectrum using MATLAB and CUDA

5.4. Problems Encountered

Few problems were encountered in the GPU implementation. Most issues were

related to unfamiliarity with GPU programming. Runtime breakpoints were

unavailable due to the fact that only a single GPU is available on the development

computer. This meant that debugging took a black-box approach, with numerous data

output from the program to be imported in MATLAB for verification.

There were few issues encountered that related to the CUDA language. CUDA

language extensions did not have sufficient protections for pointer dereferencing.

Since memory was allocated on both GPU memory and CPU memory, pointer

dereferencing was a sensitive operation. GPU code cannot access CPU memory

directly and the same is true for CPU code that attempts to access GPU memory.

Attempts to incorrectly dereference memory halts program operation.

54

NVIDIA provides memory transfer functions that can copy data from host or

GPU memory given data size and data type. Therefore, prefixes were used on pointer

variables to denote where the data resides in order to prevent incorrect memory

dereferencing (e.g. D_PLPcoeff and H_PLPcoeff for pointers to data in device and

host memory, respectively).

Wide availability of NVIDIA GPUs made access to GPU programming trivial.

The algorithm was first prototyped on a 2008 model Apple Macbook Pro laptop.

Initial results showed that the PLP algorithm could execute in 80 ms. The bottleneck

of the algorithm was in device memory accesses. A low cost (approx. $150 USD)

video card was then sourced locally, which as stated above, has an 86.4 GB/sec

device memory bandwidth. Bandwidth tests on the Macbook Pro NVIDIA GPU

shows a memory bandwidth of only 1GB/sec.

When transferring data between host and GPU memory, pinned memory

allocation was necessary. Pinned memory refers to a memory area that is allocated on

the host, which the operating system cannot move to paged memory (which is located

on the hard drive). This is necessary in order to maximize throughput during memory

transfer between host and GPU memory. NVIDIA provides custom malloc () and

free () functions to provide this functionality.

55

Chapter 6 - Conclusions

6.1. Filters Performance

Results from the MATLAB simulation have shown that the AR filter is effective

at overall whitening of the signal spectrum. The dynamic range of the spectrum is

greatly reduced in all input cases.

On the other hand, the PLP filter is not very effective at suppressing higher

harmonics in recorded signals. This is due to the high number of harmonics present in

recorded signals. It seems that the harmonics are not precise integer number

harmonics, which is why the filter has not been able to suppress them when compared

to the synthesized test case.

Using the current configuration, comparable results are achieved on polyphonic

input signals. This is important due to the predominance of polyphony in music.

Real-time implementation of the AR filter is feasible for both DSP processors and

GPUs. However, the computational complexity of the PLP filter is too large for the

DSP processor. The GPU architecture proved to be well suited for the PLP filter

implementation.

Further improvements to audio signal whitening can be made to the pitch filter.

The computational power facilitated by the GPU can accommodate a combined

search and adaptive technique for suppressing the tonal components of the input

signal. Nonetheless, this work showed that the implementation of the AFC algorithm

is possible in real-time performance.

56

6.2. Development Cost

Overall, the GPU implementation is the most cost effective in terms of time and

hardware costs. In terms of hardware costs, the video card used in the implementation

is a gaming level GPU. The cost is, in a way, subsidized due to the volume of sales in

the PC gaming industry. Numerically, the video card was sourced locally for

approximately $150 USD. The PC itself was purchased for about $600 USD. The

CUDA SDK is provided at no cost.

The main drawback in using GPUs is in its large amount of power consumption.

The GPU alone is rated for a total of 150W of thermal power dissipation. Although the

CPU was not used extensively during runtime, its power dissipation should be

considered as well. It is worth mentioning that the Intel i architecture processors

includes a GPU. However, no SDK is currently available.

Development time took approximately three weeks. The author is an experienced

C/C++ programmer, but without any previous parallel programming experience.

Coding was done in a way that reflected the DSP algorithm, with very few

optimizations done for parallel programming.

6.3. Final Remarks

The availability of massively parallel architectures in GPUs provide a cost

effective development environment for suitable algorithms. This implementation of

whitening filters was made possible in real-time only through the use of GPUs.

Therefore, their further use in real-time DSP applications should be investigated.

57

References

[1] W. Gersch, “Spectral analysis of EEGʼs by autoregressive spectral decomposition of time series,” Mathematical Biosciences, vol. 7, 1970, pp. 205-222.

[2] E.A. Robinson, “Predictive decomposition of time series with application to seismic exploration,” Geophysics, vol. 32, 1967, p. 418.

[3] R.P. Ramachandran and P. Kabal, “Pitch prediction filters in speech coding,” IEEE Transactions on Acoustics Speech and Signal Processing, vol. 37, 1989, pp. 467-478.

[4] T. Van Waterschoot and M. Moonen, “Adaptive feedback cancellation for audio applications,” Signal Processing, vol. 89, 2009, pp. 2185-2201.

[5] T. Van Waterschoot and M. Moonen, “Fifty Years of Acoustic Feedback Control: State of the Art and Future Challenges,” Proceedings of the IEEE, vol. PP, 2010, pp. 1-40.

[6] T.V. Waterschoot and M. Moonen, “ASSESSING THE ACOUSTIC FEEDBACK CONTROL PERFORMANCE OF ADAPTIVE FEEDBACK CANCELLATION IN SOUND REINFORCEMENT SYSTEMS Toon van Waterschoot and Marc Moonen,” Signal Processing, 2009, pp. 1997-2001.

[7] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, 1975, pp. 561-580.

[8] Y. Qian, G. Chahine, and P. Kabal, “Pseudo-multi-tap pitch filters in a low bit-rate CELP speech coder,” Speech Communication, vol. 14, 1994, pp. 339-358.

[9] G.U. Yule, “On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolferʼs Sunspot Numbers,” Phill Trans, vol. 226, 1927, pp. 267-298.

[10] G. Zelniker and F.J. Taylor, Advanced Digital Signal Processing: Theory and Applications (Electrical Engineering & Electronics), CRC Press, 1993.

[11] M. Dehoon, T. Vanderhagen, H. Schoonewelle, and H. Van Dam, “Why Yule-Walker should not be used for autoregressive modelling,” Annals of Nuclear Energy, vol. 23, 1996, pp. 1219-1228.

58

[12] T. Van Waterschoot and M. Moonen, “Comparison of Linear Prediction Models for Audio Signals,” EURASIP Journal on Audio Speech and Music Processing, vol. 2008, 2008, pp. 1-25.

[13] T.I. Laakso, V. Välimäki, M. Karjalainen, and U.K. Laine, “Splitting the Unit Delay---Tools for Fractional Delay Filter Design,” IEEE Signal Processing Magazine, vol. 13, 1996, pp. 30-60.

[14] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, 1975, pp. 561-580.

[15] K.P. Balanda and H.L. MacGillivray, “Kurtosis: A Critical Review,” American Statistician, vol. 42, 1988, p. 111.

[16] A. Gray and J. Markel, “A Spectral-Flatness Measure for Studying the Autocorrelation Method of Linear Predication of Speech Analysis,” IEEE Transactions on Acoustics Speech and Signal Processing, vol. 22, 1974, pp. 207-217.

[17] “Piano key frequencies - Wikipedia, the free encyclopedia.” [online] http://en.wikipedia.org/wiki/Piano_key_frequencies (accessed: June 20, 2011).

[18] “eins - Wiktionary.” [online] http://en.wiktionary.org/wiki/eins (accessed: June 20, 2011).

[19] K. Karplus and A. Strong, “Digital synthesis of plucked-string and drum timbres,” Computer Music Journal, vol. 7, 1983, pp. 43-55.

[20] C. Collomb, “Linear Prediction and the Levinson-Durbin algorithm,” 2009, pp. 1-7.

[21] G.-J. Inversion, “Gauss-Jordan Inversion of a Matrix,” October, 1998, pp. 1-4.

[22] J. Xu, A. Ariyaeeinia, and R. Sotudeh, “Migrate Levinson-Durbin based Linear Predictive Coding algorithm into FPGAS,” 2005 12th IEEE International Conference on Electronics, Circuits and Systems, Dec. 2005, pp. 1-4.

[23] L. Savioja, V. Valimaki, and J. Smith, “Audio Signal Processing Using Graphics Processing Units,” submitted to Journal of the Audio Engineering Society, 2010, pp. 3-19.

59

Appendix A – DSP Implementation Code Listing

1. ///////////////////////////////////////////////////////////////////////////// 2. // 3. // Whitening Filters Implementation 4. // on Analog Devices SHARC ADPS-‐2169 5. // 6. // Omer Osman 7. // June 2011 8. // 9. ///////////////////////////////////////////////////////////////////////////// 10. 11. 12. #include "btc.h" 13. #include "signal.h" 14. #include <cdef21369.h> 15. #include <def21369.h> 16. #include <signal.h> 17. asm("#include <def21369.h>"); 18. #include <SRU.h> 19. #include <sysreg.h> 20. 21. #define DATA_BUF_SIZE 8 22. #if __ADSP21369__ 23. #define ARRAY_SIZE 0x2000 24. #define DATA_ARRAY_STRING "Data Array (8kw)" 25. #endif 26. 27. #include <cycle_count.h> 28. #include <cycles.h> 29. #include <stdio.h> 30. #include <vector.h> 31. #include <stats.h> 32. #include <matrix.h> 33. #include <filter.h> 34. #include "initPLL_SDRAM.c" 35. 36. /// MATLAB DATA /// 37. #include "note.dat" 38. //#include "noise.dat" 39. //#include "speech.dat" 40. #include "IFw_taps.dat" 41. #include "polyphase.dat" 42. 43. /// DEFINITIONS /// 44. #define M 1024 45. #define P 1024 46. #define kmin 44 47. #define kmax 441 48. #define na 456 49. #define polyphase_coefficients 161 50. #define I 8 51. #define AR_order 30 52. #define frac_d_window M 53. #define interp_half_order 10 54. #define interp_order 2*interp_half_order 55. 56.

60

57. /// extern GLOBAL VARIABLES /// 58. extern void InitPLL_SDRAM(void); 59. extern const float pm note [M]; 60. //extern const float dm noise [M]; 61. //extern const float dm speech [M]; 62. extern float pm polyphase_coeff [161]; 63. 64. 65. 66. //////////////////////////// 67. // Variable Definitions 68. //////////////////////////// 69. int timerCounter = 0; 70. int dataVal = 0x11223344; 71. int dataBuf[DATA_BUF_SIZE] = {0x11223344,0x55667788,0x99aabbcc,0xddeeff00, 72. 0x55555555,0x66666666,0x77777777,0x88888888}; 73. int array1[ARRAY_SIZE]; 74. 75. float 76. ARcoeff [AR_order+1]; 77. float 78. PLPcoeff [na+1]; 79. float 80. fracDelay[M]; 81. 82. float 83. fir_input[(frac_d_window+10)*I]; 84. float 85. fir_output[(frac_d_window+10)*I]; 86. 87. 88. // PLP 89. float 90. bestMSE[3]; 91. float pm 92. prediction_error [kmax+M+interp_half_order+1]; 93. float 94. response [kmax+interp_half_order+2]; 95. float pm 96. responseR [kmax+interp_half_order+2]; 97. float pm 98. PLPiR [kmax+interp_half_order+2]; 99. 100. 101. float 102. x_0 [M-‐kmin]; 103. float 104. x_Mmin [M-‐kmin]; 105. float 106. x_M [M-‐kmin]; 107. float 108. x_Mplus [M-‐kmin]; 109. float 110. a_0[3]; 111. float 112. a_M[3]; 113. float 114. taps[3]; 115. 116. float 117. delta;

61

118. float 119. Rxx [AR_order+1]; 120. float 121. kappa; 122. float 123. kappa2; 124. float 125. sigma2; 126. double 127. status [2]; 128. 129. //////////////////////// 130. // Function Prototypes 131. //////////////////////// 132. void initInterrupts(void); 133. void initTimer(void); 134. 135. void GPTimer0_isr(int signal); 136. 137. void findARcoeff (const float* X, float* ARcoeff); 138. void findPLPcoeff (const float[], float* PLPcoeff); 139. void frac_delay (const float* X); 140. void one_tap_polyphasePLP (const float* X); 141. void three_tap_polyphasePLP (const float* X); 142. 143. 144. // ASSEMBLY ROUTINE FOR RECIPROCAL DIVISION // 145. // FROM SHARC 21369 PROGRAMMING MANUAL // 146. // CAN BE EXPANDED FOR HIGHER PRECISION // 147. // REF-‐ NEWTON-‐RAPHSON METHOD // 148. /* 149. .global _fp_division; 150. _fp_division: 151. F0=%1;" // numerator 152. F12=%2;" // denominator 153. F11=%3;" // 2.0 154. F0=RECIPS F12, F7=F0;" // {Get 8 bit seed R0=1/D} 155. F12=F0*F12;" // {D' = D*R0} 156. F7=F0*F7, F0=F11-‐F12;" // {F0=R1=2-‐D', F7=N*R0} 157. F12=F0*F12;" // {F12=D'-‐D'*R1} 158. F7=F0*F7, F0=F11-‐F12;" // {F7=N*R0*R1, F0=R2=2-‐D'} 159. F12=F0*F12;" // {F12=D'=D'*R2} 160. RTS(DB); 161. F7=F0*F7, F0=F11-‐F12;" // {F7=N*R0*R1*R2, F0=R3=2-‐D'} 162. F0=F0*F7; 163. %0=F0*F7;" // {F7=N*R0*R1*R2*R3} 164. "=F" (sigma2) 165. : "F" (x), "F" (y), "F" (z) 166. : "F0", "F12", "F11", "F7"); 167. */ 168. 169. //////////////////// 170. // BACKGROUND TELEMETRY CHANNEL Definitions 171. //////////////////// 172. BTC_MAP_BEGIN 173. // Channel Name, Starting Address, Length 174. BTC_MAP_ENTRY("Timer Interrupt Counter", (long)&timerCounter, sizeof(tim

erCounter)) 175. BTC_MAP_ENTRY("Constant Data Value", (long)&dataVal, sizeof(dat

aVal))

62

176. BTC_MAP_ENTRY("Constant Data Buffer", (long)dataBuf, sizeof(dataBuf))

177. BTC_MAP_ENTRY(DATA_ARRAY_STRING, (long)array1, sizeof(array1))

178. BTC_MAP_ENTRY("Delta", (long)&delta, sizeof(delta))

179. BTC_MAP_ENTRY("Sigma2", (long)&sigma2, sizeof(sigma2))

180. BTC_MAP_ENTRY("Rxx", (long)Rxx, sizeof(Rxx))

181. BTC_MAP_ENTRY("kappa", (long)&kappa, sizeof(kappa))

182. BTC_MAP_ENTRY("kappa2", (long)&kappa2, sizeof(kappa2))

183. BTC_MAP_ENTRY("ARCoefficients", (long)ARcoeff, sizeof(ARcoeff))

184. BTC_MAP_ENTRY("FracDelay", (long)fracDelay, sizeof(fracDelay))

185. BTC_MAP_ENTRY("ASTATx ASTATy", (long)status, sizeof(status))

186. BTC_MAP_END 187. 188. /////////////////// 189. // Main Program 190. /////////////////// 191. int main() 192. { 193. InitPLL_SDRAM(); 194. 195. //sysreg_bit_clr(sysreg_MMASK, PEYEN); // ENABLE SECOND ALU 196. //sysreg_bit_set(sysreg_MODE1, PEYEN); // set Processor Element Y (

SIMD) enable 197. //sysreg_bit_clr(sysreg_MODE1, RND32); // set IEEE-‐754 32-‐

bit Floating Point 198. sysreg_bit_set(sysreg_MODE1, CBUFEN); // set hardware circular buffe

r enable 199. //sysreg_bit_clr(sysreg_MODE1, TRUNC); // set truncation mode to ne

arest 200. ///sysreg_bit_set(sysreg_MODE1, NESTM); // set nested multiple inte

rrupts enable 201. //sysreg_bit_set(sysreg_MODE1, IRPTEN); // set interrupt enable 202. 203. 204. int addr, len; 205. addr = BTC_CHANNEL_ADDR(0); 206. len = BTC_CHANNEL_LEN(0); 207. 208. for(int i = 0; i < ARRAY_SIZE; ++i) 209. { 210. array1[i] = i; 211. } 212. 213. // initialize 214. btc_init(); 215. 216. initTimer(); 217. interrupt(SIG_EMUL, btc_isr); 218. 219. cycle_t 220. start_count; 221. cycle_t

63

222. final_count; 223. 224. // profiling functionality // 225. //START_CYCLE_COUNT(start_count); 226. 227. findARcoeff (speech, ARcoeff); 228. 229. //STOP_CYCLE_COUNT(final_count,start_count); 230. //PRINT_CYCLES("Number of cycles for AR Filter: ",final_count); 231. 232. findPLPcoeff (note, PLPcoeff); 233. 234. initInterrupts(); 235. 236. //while(1); 237. 238. } 239. 240. 241. void initInterrupts() 242. { 243. interrupt(SIG_P2, GPTimer0_isr); 244. } 245. 246. 247. void initTimer() 248. { 249. 250. *pTM0CTL = TIMODEPWM | PRDCNT | IRQEN; // configure the timer 251. *pTM0PRD = 0x00800000; // timer period 252. *pTM0W = 1; // timer width 253. *pTMSTAT = BIT_8; // enable the timer 254. } 255. 256. 257. void GPTimer0_isr(int signal) 258. { 259. // clear timer interrupt status 260. *pTMSTAT = TIM0IRQ; 261. 262. ++timerCounter; // count number of timer interrupts 263. array1[0] = timerCounter; // reflect count in first location of ar

ray1 264. 265. // toggle LED1 on the EZ-‐Kit 266. asm("bit tgl flags FLG4;"); //light LED 1 267. 268. } 269. 270. // AR coefficient estimation using levinson durbin 271. void findARcoeff (const float* X, float* Coeffs) 272. { 273. 274. // moved as global vars 275. // float 276. // Rxx [AR_order+1]; 277. // float 278. // alpha [AR_order]; 279. // float 280. // delta; 281. // float

64

282. // sigma2; 283. // float 284. // kappa2; 285. float 286. z = 2.0; 287. 288. 289. autocorrf( Rxx, X, M, AR_order+1 ); // 21369 library function

290. 291. sigma2 = Rxx[0]; 292. Coeffs[0] = 1; 293. 294. for (int m=0; m < AR_order; m++) 295. { 296. delta = 0; 297. for (int j=0; j <= m; j++) 298. delta += Coeffs[j] * Rxx[(m-‐j)+1]; 299. kappa = -‐(delta/sigma2); 300. 301. // alternative asm func for kappa 302. // asm ("F0=%1;" // numerator 303. // "F12=%2;" // denominator 304. // "F11=%3;" // 2.0 305. // "F0=RECIPS F12, F7=F0;" // {Get 8 bit seed R0=1/D} 306. // "F12=F0*F12;" // {D' = D*R0} 307. // "F7=F0*F7, F0=F11-‐F12;" // {F0=R1=2-‐D', F7=N*R0} 308. // "F12=F0*F12;" // {F12=D'-‐D'*R1} 309. // "F7=F0*F7, F0=F11-‐F12;" // {F7=N*R0*R1, F0=R2=2-‐D'} 310. // "F12=F0*F12;" // {F12=D'=D'*R2} 311. // "F7=F0*F7, F0=F11-‐F12;" // {F7=N*R0*R1*R2, F0=R3=2-‐

D'} 312. // "%0=F0*F7;" // {F7=N*R0*R1*R2*R3} 313. // : "=F" (kappa) 314. // : "F" (delta), "F" (sigma2), "F" (z) 315. // : "F0", "F12", "F11", "F7"); 316. 317. kappa2 = kappa*kappa; 318. sigma2 -‐= sigma2*kappa2; 319. status[0] = sysreg_read (sysreg_STKY); 320. status[1] = sysreg_read (sysreg_STKYY); 321. for (int k=0; k <= m; k++) 322. Coeffs[k+1] = Coeffs[k+1] + (kappa * Coeffs[m-‐

k]); 323. } 324. 325. 326. return; 327. } 328. 329. 330. // AR coeff estimation using gauss-‐jordan matrix inversion 331. void findARcoeff (const float* X, float* Coeffs) 332. { 333. float 334. matrix [AR_order][AR_order]; 335. float 336. invmat [AR_order][AR_order]; 337. float 338. p[AR_order]; 339. int

65

340. i,j,k; 341. float 342. x = 0.011304, 343. y = 0.011595, 344. z = 2.0; 345. 346. // 21369 library func for autocorr for lags AR_order+1 taps 347. autocorrf (Rxx, X, M, AR_order+1); 348. 349. // toeplitz matrix 350. for (i=0; i < AR_order; ++i) 351. { 352. for (j=i; j < AR_order; ++j) 353. { 354. matrix [i][j] = Rxx[j-‐i]; 355. matrix [j][i] = Rxx[j-‐i]; 356. } 357. } 358. 359. // 21369 library function for gauss jordan method 360. matinvf ((float*)invmat, (float*)matrix, AR_order); 361. 362. for (i=0; i< AR_order; ++i) 363. p[i] = -‐Rxx[i+1]; 364. 365. 366. // 21369 library func for matrix multiplication 367. matmmltf (((float*) Coeffs)+sizeof(float), (const float*) invmat, (c

onst float*) p, AR_order, AR_order, 1); 368. 369. Coeffs[0] = 1.0; 370. 371. return; 372. } 373. 374. 375. 376. 377. void findPLPcoeff (const float X[], float* Coeff) 378. { 379. int 380. bestMSE[3]; // 0 = MSE, 1 = bulk delay, 2 = fractional phase 381. // float 382. // polyphase[8][M-‐10]; 383. /* float 384. interp_output[M]; 385. float 386. fir_input[4096]; 387. */ 388. int 389. j = 0; 390. 391. // profiling // 392. cycle_t 393. start_count; 394. cycle_t 395. final_count; 396. 397. 398. START_CYCLE_COUNT(start_count); 399.

66

400. // CYCLES_INIT(stats); 401. 402. one_tap_polyphasePLP (X); 403. //three_tap_polyphasePLP (X); 404. 405. // CYCLES_PRINT(stats); 406. // CYCLES_RESET(stats); 407. 408. STOP_CYCLE_COUNT(final_count,start_count); 409. PRINT_CYCLES("Number of cycles for polyphase FIR+PLP: ",final_count)

; 410. 411. return; 412. } 413. 414. 415. void frac_delay (const float X[]) 416. { 417. float 418. state[polyphase_coefficients]; 419. // float 420. // fir_input[(frac_d_window+10)*I]; 421. /* float 422. fir_output[(frac_d_window+10)*I]; 423. float 424. interp_output[frac_d_window]; 425. */ int 426. j = 0; 427. 428. for (j=0; j < frac_d_window; ++j) 429. fir_input[j*I] = X[j]; 430. 431. for (j=0; j < polyphase_coefficients; ++j) 432. state[j] = 0.0f; 433. 434. 435. // polyphase FIR 436. fir (fir_input, fir_output, polyphase_coeff, state, (frac_d_window+1

0)*I, polyphase_coefficients-‐1); 437. 438. // output data into separate variables no longer used 439. /* 440. for (j=0; j < frac_d_window; ++j) 441. { 442. interp_output_1[j] = fir_output[(10+j)*I]*8; 443. interp_output_2[j] = fir_output[(10+j)*I+1]*8; 444. interp_output_3[j] = fir_output[(10+j)*I+2]*8; 445. interp_output_4[j] = fir_output[(10+j)*I+3]*8; 446. interp_output_5[j] = fir_output[(10+j)*I+4]*8; 447. interp_output_6[j] = fir_output[(10+j)*I+5]*8; 448. interp_output_7[j] = fir_output[(10+j)*I+6]*8; 449. } 450. */ 451. return; 452. } 453. 454. // 3Ts PLP-‐ suboptimal search based on 1 tap MSE 455. void one_tap_polyphasePLP (const float* X) 456. { 457. int 458. f, j, k, l;

67

459. int 460. size, MSE; 461. float 462. state[kmax+interp_half_order+2]; 463. 464. float 465. R_0M, R_MM, tap; 466. 467. float 468. matrix [3][3]; 469. float 470. invmat [3][3]; 471. 472. float 473. b[3]; 474. 475. cycle_t 476. start_count; 477. cycle_t 478. final_count; 479. 480. 481. // START_CYCLE_COUNT(start_count); 482. 483. frac_delay (X); 484. 485. // STOP_CYCLE_COUNT(final_count,start_count); 486. // PRINT_CYCLES("Number of cycles for polyphase FIR: ",final_count); 487. 488. 489. for (k=kmin; k < kmax; ++k) // bulk delay 490. { 491. size = M-‐k; 492. 493. for (j=0; j < size; ++j) 494. x_0[j] = X[j+k]; 495. 496. 497. for (f=0; f < I; ++f) // fractional phase 498. { 499. switch (f){ 500. case 0: 501. for (j=0; j < size; ++j) 502. x_M[j] = X[k+j]; 503. break; 504. case 1: 505. for (j=0; j < size; ++j) 506. x_M[j] = fir_output[(10+k+j)*I]*8; // frac delay

1 507. break; 508. case 2: 509. for (j=0; j < size; ++j) 510. x_M[j] = fir_output[(10+k+j)*I+1]*8; // frac del

ay 2 511. break; 512. case 3: 513. for (j=0; j < size; ++j) 514. x_M[j] = fir_output[(10+k+j)*I+2]*8; // etc 515. break; 516. case 4: 517. for (j=0; j < size; ++j)

68

518. x_M[j] = fir_output[(10+k+j)*I+3]*8; 519. break; 520. case 5: 521. for (j=0; j < size; ++j) 522. x_M[j] = fir_output[(10+k+j)*I+4]*8; 523. break;

524. case 6: 525. for (j=0; j < size; ++j) 526. x_M[j] = fir_output[(10+k+j)*I+5]*8; 527. break; 528. case 7: 529. for (j=0; j < size; ++j) 530. x_M[j] = fir_output[(10+k+j)*I+6]*8; 531. break; 532. } 533. 534. // one tap plp calculation 535. R_0M = vecdotf (x_0, x_M, size); 536. R_MM = vecdotf (x_M, x_M, size); 537. 538. tap = R_0M/R_MM; 539. 540. for (j=0; j < k+interp_order+2; ++j) 541. response[j] = 0.0; 542. 543. response [0] = 1; 544. for (j=k-‐

interp_half_order+1; j < k+interp_half_order+1; ++j) 545. response[j] = -‐tap*IFw_taps[f][j-‐k+interp_half_order]; 546. 547. for (j=0; j < k+interp_order+1; ++j) 548. responseR[j] = response[(k+interp_order)-‐j]; 549. 550. for (j=0; j < kmax+interp_half_order+2; ++j) 551. state[j] = 0.0f; 552. 553. // MOST time consumption is in this line of code 554. START_CYCLE_COUNT(start_count); 555. fir (X, prediction_error, responseR, state, M, k+interp_orde

r+1); 556. STOP_CYCLE_COUNT(final_count,start_count); 557. PRINT_CYCLES("Number of cycles : ",final_count);

558. 559. // MSE calc 560. MSE = vecdotf(prediction_error, prediction_error, M); 561. 562. if (MSE <= 0) 563. MSE = 10000; // null condition 564. 565. if (bestMSE[0] > MSE || bestMSE[2] == 0) 566. { 567. bestMSE[0] = MSE; 568. bestMSE[1] = k; 569. bestMSE[2] = f; 570. } 571. } 572. } 573. 574.

69

575. // 3 tap coefficients calc based on best 1-‐tap plp 576. k = bestMSE[1]; 577. f = bestMSE[2]; 578. size = M-‐k; 579. 580. for (j=0; j < size; ++j) 581. x_0[j] = X[j+k+1]; 582. 583. 584. for (j=0; j < size; ++j) 585. { 586. x_Mmin[j] = fir_output[(10+k+j-‐1)*I+f-‐1]*8; 587. x_M[j] = fir_output[(10+k+j)*I+f-‐1]*8; 588. x_Mplus[j] = fir_output[(10+k+j+1)*I+f-‐1]*8; 589. } 590. 591. a_0[0] = vecdotf(x_0, x_Mmin, size); 592. a_0[1] = vecdotf (x_0, x_M, size); 593. a_0[2] = vecdotf (x_0, x_Mplus, size); 594. a_M[0] = vecdotf (x_Mmin, x_Mmin, size); 595. a_M[1] = vecdotf (x_Mmin, x_M, size); 596. a_M[2] = vecdotf(x_Mmin, x_Mplus, size); 597. 598. 599. for (l=0; l < 3; ++l) 600. { 601. for (j=l; j < 3; ++j) 602. { 603. matrix [l][j] = a_M[j-‐l]; 604. matrix [j][l] = a_M[j-‐l]; 605. } 606. } 607. 608. b[0] = a_0[0]; 609. b[1] = a_0[1]; 610. b[2] = a_0[2]; 611. 612. matinvf ((float*)invmat, (float*)matrix, 3); 613. 614. matmmltf ((float*)taps, (const float*)invmat, (const float*)b, 3, 3,

1); 615. 616. for (j=0; j < k+interp_order+2; ++j) 617. PLPiR[j] = 0.0; 618. 619. PLPiR[0] = 1; 620. for (j=k-‐

interp_half_order; j < k+interp_half_order; ++j) // first tap 621. PLPiR[j] = -‐taps[0]*IFw_taps[f][j-‐k+interp_half_order]; 622. for (j=k-‐interp_half_order; j < k+interp_half_order; ++j) // 2nd 623. PLPiR[j+1] -‐= taps[1]*IFw_taps[f][j-‐k+interp_half_order]; 624. for (j=k-‐interp_half_order; j < k+interp_half_order; ++j) // 3rd 625. PLPiR[j+2] -‐= taps[2]*IFw_taps[f][j-‐

k+interp_half_order]; 626. 627. for (j=0; j < k+interp_order+1; ++j) 628. responseR[j] = PLPiR[(k+interp_order)-‐j]; 629. 630. for (j=0; j < kmax+interp_half_order+2; ++j) 631. state[j] = 0.0f; 632.

70

633. fir (X, prediction_error, responseR, state, M, k+interp_order+1); 634. 635. sigma2 = vecdotf(prediction_error, prediction_error, M); 636. 637. return; 638. } 639. 640. 641. // exhaustive search PLP 642. void three_tap_polyphasePLP (const float* X) 643. { 644. 645. int 646. f, j, k, l; 647. int 648. size, MSE; 649. float 650. state[kmax+interp_half_order+2]; 651. 652. float 653. R_0M, R_MM, tap; 654. 655. float 656. matrix [3][3]; 657. float 658. invmat [3][3]; 659. 660. float 661. b[3]; 662. 663. frac_delay (X); 664. 665. for (k=kmin; k < kmax; ++k) // bulk delay 666. { 667. size = M-‐k; 668. 669. for (j=0; j < size; ++j) 670. x_0[j] = X[j+k]; 671. 672. 673. for (f=0; f < I; ++f) // fractional phase 674. { 675. switch (f){ 676. case 0: 677. for (j=0; j < size; ++j) 678. x_M[j] = X[k+j]; 679. break; 680. case 1: 681. for (j=0; j < size; ++j) 682. x_M[j] = fir_output[(10+k+j)*I]*8; // frac delay

1 683. break; 684. case 2: 685. for (j=0; j < size; ++j) 686. x_M[j] = fir_output[(10+k+j)*I+1]*8; 687. break; 688. case 3: 689. for (j=0; j < size; ++j) 690. x_M[j] = fir_output[(10+k+j)*I+2]*8; 691. break; 692. case 4:

71

693. for (j=0; j < size; ++j) 694. x_M[j] = fir_output[(10+k+j)*I+3]*8; 695. break; 696. case 5: 697. for (j=0; j < size; ++j) 698. x_M[j] = fir_output[(10+k+j)*I+4]*8; 699. break;

700. case 6: 701. for (j=0; j < size; ++j) 702. x_M[j] = fir_output[(10+k+j)*I+5]*8; 703. break; 704. case 7: 705. for (j=0; j < size; ++j) 706. x_M[j] = fir_output[(10+k+j)*I+6]*8; 707. break; 708. } 709. 710. R_0M = vecdotf (x_0, x_M, size); 711. R_MM = vecdotf (x_M, x_M, size); 712. 713. tap = R_0M/R_MM; 714. 715. for (j=0; j < k+interp_order+2; ++j) 716. response[j] = 0.0; 717. 718. response [0] = 1; 719. for (j=k-‐

interp_half_order+1; j < k+interp_half_order+1; ++j) 720. response[j] = -‐tap*IFw_taps[f][j-‐k+interp_half_order]; 721. 722. for (j=0; j < k+interp_order+1; ++j) 723. responseR[j] = response[(k+interp_order)-‐j]; 724. 725. for (j=0; j < kmax+interp_half_order+2; ++j) 726. state[j] = 0.0f; 727. 728. fir (X, prediction_error, responseR, state, M, k+interp_orde

r+1); 729. 730. MSE = vecdotf(prediction_error, prediction_error, M); 731. 732. // 3 tap filter response 733. size = M-‐k; 734. 735. for (j=0; j < size; ++j) 736. x_0[j] = X[j+k+1]; 737. 738. 739. for (j=0; j < size; ++j) 740. { 741. x_Mmin[j] = fir_output[(10+k+j-‐1)*I+f-‐1]*8; 742. x_M[j] = fir_output[(10+k+j)*I+f-‐1]*8; 743. x_Mplus[j] = fir_output[(10+k+j+1)*I+f-‐

1]*8; 744. } 745. 746. a_0[0] = vecdotf(x_0, x_Mmin, size); 747. a_0[1] = vecdotf (x_0, x_M, size); 748. a_0[2] = vecdotf (x_0, x_Mplus, size); 749. a_M[0] = vecdotf (x_Mmin, x_Mmin, size);

72

750. a_M[1] = vecdotf (x_Mmin, x_M, size); 751. a_M[2] = vecdotf(x_Mmin, x_Mplus, size); 752. 753. 754. for (l=0; l < 3; ++l) 755. { 756. for (j=l; j < 3; ++j) 757. { 758. matrix [l][j] = a_M[j-‐l]; 759. matrix [j][l] = a_M[j-‐l]; 760. } 761. } 762. 763. b[0] = a_0[0]; 764. b[1] = a_0[1]; 765. b[2] = a_0[2]; 766. 767. matinvf ((float*)invmat, (float*)matrix, 3); 768. 769. matmmltf ((float*)taps, (const float*)invmat, (const float*)

b, 3, 3, 1); 770. 771. for (j=0; j < k+interp_order+2; ++j) 772. PLPiR[j] = 0.0; 773. 774. PLPiR[0] = 1; 775. for (j=k-‐

interp_half_order; j < k+interp_half_order; ++j) // first tap 776. PLPiR[j] = -‐taps[0]*IFw_taps[f][j-‐

k+interp_half_order]; 777. for (j=k-‐

interp_half_order; j < k+interp_half_order; ++j) // 2nd 778. PLPiR[j+1] -‐= taps[1]*IFw_taps[f][j-‐

k+interp_half_order]; 779. for (j=k-‐

interp_half_order; j < k+interp_half_order; ++j) // 3rd 780. PLPiR[j+2] -‐= taps[2]*IFw_taps[f][j-‐

k+interp_half_order]; 781. 782. for (j=0; j < k+interp_order+1; ++j) 783. responseR[j] = PLPiR[(k+interp_order)-‐j]; 784. 785. for (j=0; j < kmax+interp_half_order+2; ++j) 786. state[j] = 0.0f; 787. 788. fir (X, prediction_error, responseR, state, M, k+interp_orde

r+1); 789. 790. MSE = vecdotf(prediction_error, prediction_error, M); 791. 792. 793. if (MSE <= 0) 794. MSE = 10000; 795. 796. if (bestMSE[0] > MSE || bestMSE[2] == 0) 797. { 798. bestMSE[0] = MSE; 799. bestMSE[1] = k; 800. bestMSE[2] = f; 801. } 802. }

73

803. } 804. 805. // 3 tap filter response 806. k = bestMSE[1]; 807. f = bestMSE[2]; 808. size = M-‐k; 809. 810. for (j=0; j < size; ++j) 811. x_0[j] = X[j+k+1]; 812. 813. 814. for (j=0; j < size; ++j) 815. { 816. x_Mmin[j] = fir_output[(10+k+j-‐1)*I+f-‐1]*8; 817. x_M[j] = fir_output[(10+k+j)*I+f-‐1]*8; 818. x_Mplus[j] = fir_output[(10+k+j+1)*I+f-‐1]*8; 819. } 820. 821. a_0[0] = vecdotf(x_0, x_Mmin, size); 822. a_0[1] = vecdotf (x_0, x_M, size); 823. a_0[2] = vecdotf (x_0, x_Mplus, size); 824. a_M[0] = vecdotf (x_Mmin, x_Mmin, size); 825. a_M[1] = vecdotf (x_Mmin, x_M, size); 826. a_M[2] = vecdotf(x_Mmin, x_Mplus, size); 827. 828. 829. for (l=0; l < 3; ++l) 830. { 831. for (j=l; j < 3; ++j) 832. { 833. matrix [l][j] = a_M[j-‐l]; 834. matrix [j][l] = a_M[j-‐l]; 835. } 836. } 837. 838. b[0] = a_0[0]; 839. b[1] = a_0[1]; 840. b[2] = a_0[2]; 841. 842. matinvf ((float*)invmat, (float*)matrix, 3); 843. 844. matmmltf ((float*)taps, (const float*)invmat, (const float*)b, 3, 3,

1); 845. 846. for (j=0; j < k+interp_order+2; ++j) 847. PLPiR[j] = 0.0; 848. 849. PLPiR[0] = 1; 850. for (j=k-‐

interp_half_order; j < k+interp_half_order; ++j) // first tap 851. PLPiR[j] = -‐taps[0]*IFw_taps[f][j-‐k+interp_half_order]; 852. for (j=k-‐interp_half_order; j < k+interp_half_order; ++j) // 2nd 853. PLPiR[j+1] -‐= taps[1]*IFw_taps[f][j-‐k+interp_half_order]; 854. for (j=k-‐interp_half_order; j < k+interp_half_order; ++j) // 3rd 855. PLPiR[j+2] -‐= taps[2]*IFw_taps[f][j-‐

k+interp_half_order]; 856. 857. 858. return; 859. }

74

Appendix B – GPU Implementation Code Listing

Header File with Algorithm Definitions

1. /* 2. * Cascade_PLP.h 3. * Cascade_PLP 4. * 5. * HEADER FILE WITH DEFINITIONS FOR PLP/AR ALGORITHM 6. */ 7. 8. // DEFINITIONS 9. 10. #ifndef CASCADE_PLP_H 11. #define CASCADE_PLP_H 12. 13. #define DEBUG_ON 0 14. #define VERBOSE 0 15. #define TIMING 0 16. 17. #define M 2048 18. #define FRAC_DELAYS 8 19. #define INTERP_HALF_ORDER 10 20. #define K_MIN 44 21. #define K_MAX 441 22. #define POLYPHASE_COEFFS 161 23. #define POLYPHASE_PAD 1024 24. #define POLYPHASE_BULK_DELAY 80 25. #define FILTER_W_SIZE 2048 26. #define SIGNAL_SIZE FRAC_DELAYS*(M) 27. #define LP_ORDER 30 28. 29. // used in batch FFT data vector iteration // 30. #define i_INC (K_MAX-‐K_MIN)*FILTER_W_SIZE 31. #define j_INC FILTER_W_SIZE 32. 33. #endif

Main Driver File

1. ///////////////////////////////////////////////////////////////////// 2. // 3. // 4. // Whitening Filters Implementation 5. // on NVIDIA GPU CUDA FRAMEWORK 6. // 7. // MAIN FILE 8. // USES PORTAUDIO LIBRARY for ASIO AUDIO INPUT/OUTPUT 9. // MODIFIED CODE BELOW FROM patest_wire.c EXAMPLE 10. //

75

11. // Omer Osman 12. // July 2011 13. // 14. // 15. ///////////////////////////////////////////////////////////////////// 16. 17. /** @file patest_wire.c 18. @ingroup test_src 19. @brief Pass input directly to output. 20. 21. Note that some HW devices, for example many ISA audio cards 22. on PCs, do NOT support full duplex! For a PC, you normally need 23. a PCI based audio card such as the SBLive. 24. 25. @author Phil Burk http://www.softsynth.com 26. 27. While adapting to V19-‐API, I excluded configs with framesPerCallback=0 28. because of an assert in file pa_common/pa_process.c. Pieter, Oct 9, 2003. 29. 30. */ 31. /* 32. * $Id: patest_wire.c 1368 2008-‐03-‐01 00:38:27Z rossb $ 33. * 34. * This program uses the PortAudio Portable Audio Library. 35. * For more information see: http://www.portaudio.com 36. * Copyright (c) 1999-‐2000 Ross Bencina and Phil Burk 37. * 38. * Permission is hereby granted, free of charge, to any person obtaining 39. * a copy of this software and associated documentation files 40. * (the "Software"), to deal in the Software without restriction, 41. * including without limitation the rights to use, copy, modify, merge, 42. * publish, distribute, sublicense, and/or sell copies of the Software, 43. * and to permit persons to whom the Software is furnished to do so, 44. * subject to the following conditions: 45. * 46. * The above copyright notice and this permission notice shall be 47. * included in all copies or substantial portions of the Software. 48. * 49. * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 50. * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 51. * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 52. * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR 53. * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF 54. * CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 55. * WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 56. */ 57. 58. /* 59. * The text above constitutes the entire PortAudio license; however, 60. * the PortAudio community also makes the following non-‐binding requests: 61. * 62. * Any person wishing to distribute modifications to the Software is 63. * requested to send the modifications to the original developer so that 64. * they can be incorporated into the canonical version. It is also 65. * requested that these non-‐binding requests be included along with the 66. * license above. 67. */ 68. 69. #include <stdio.h> 70. #include <math.h> 71. #include <iostream>

76

72. #include <fstream> 73. #include "portaudio.h" 74. #include "Cascade.h" 75. #include <gsl/gsl_linalg.h> 76. #include <windows.h> 77. 78. // used by portaudio // 79. #define SAMPLE_RATE (44100) 80. #define HAVE_INLINE 81. 82. typedef struct WireConfig_s 83. { 84. int isInputInterleaved; 85. int isOutputInterleaved; 86. int numInputChannels; 87. int numOutputChannels; 88. int framesPerCallback; 89. } WireConfig_t; 90. 91. #define USE_FLOAT_INPUT (1) 92. #define USE_FLOAT_OUTPUT (1) 93. 94. /* Latencies set to defaults. */ 95. 96. #if USE_FLOAT_INPUT 97. #define INPUT_FORMAT paFloat32 98. typedef float INPUT_SAMPLE; 99. #else 100. #define INPUT_FORMAT paInt16 101. typedef short INPUT_SAMPLE; 102. #endif 103. 104. #if USE_FLOAT_OUTPUT 105. #define OUTPUT_FORMAT paFloat32 106. typedef float OUTPUT_SAMPLE; 107. #else 108. #define OUTPUT_FORMAT paInt16 109. typedef short OUTPUT_SAMPLE; 110. #endif 111. 112. double gInOutScaler = 1.0; 113. #define CONVERT_IN_TO_OUT(in) ((OUTPUT_SAMPLE) ((in) * gInOutScaler)) 114. 115. #define INPUT_DEVICE (Pa_GetDefaultInputDevice()) 116. #define OUTPUT_DEVICE (Pa_GetDefaultOutputDevice()) 117. 118. 119. // semaphore 120. static volatile unsigned int RDY; 121. volatile int SINGULAR; 122. 123. // PLP/AR DATA ARRAYS 124. volatile float INPUT_ARY [M]; 125. volatile float OUTPUT_ARY [M]; 126. volatile float PLP_INPUT_ARY [M]; 127. volatile float AR_INPUT_ARY [M]; 128. volatile int prev_framesperBuffer; 129. volatile int curr_framesperBuffer; 130. float 131. AUTOCORR [LP_ORDER+1]; 132. double

77

133. b_vect [LP_ORDER]; 134. float 135. coeff [LP_ORDER+1]; 136. double 137. matrix [LP_ORDER*LP_ORDER]; 138. 139. 140. // CUDA Runtime Functions // 141. extern "C" void initPLP (); 142. extern "C" void initAR (); 143. extern "C" void delMEM (); 144. extern "C" int runKernels (volatile float*, volatile float*, float*); 145. extern "C" void printDeviceProperties (); 146. extern "C" void testMemTransferSpeed (); 147. extern "C" void runFFTConv (volatile float*, volatile float*, float*); 148. extern "C" void runPolyphaseFFTConv (float*); 149. 150. 151. // HELPERS // 152. extern "C" void ImportFromFile (float*); 153. extern "C" void writeToFileTaps (float*); 154. extern "C" void writeToFileTapVectors (float*); 155. extern "C" void writeToFileMatrix (float*); 156. extern "C" void writeToFileOutput (float*); 157. 158. 159. // AR filter // 160. extern "C" void initCUBLASFunc (); 161. extern "C" void destroyCUBLASFunc (); 162. extern "C" void cudaInvertMatrix(unsigned int, float *); 163. 164. 165. // portaudio routines // 166. static PaError TestConfiguration( WireConfig_t *config ); 167. 168. static int wireCallback( const void *inputBuffer, void *outputBuffer, 169. unsigned long framesPerBuffer, 170. const PaStreamCallbackTimeInfo* timeInfo, 171. PaStreamCallbackFlags statusFlags, 172. void *userData ); 173. 174. /* This routine will be called by the PortAudio engine when audio is nee

ded. 175. ** It may be called at interrupt level on some machines so don't do anyt

hing 176. ** that could mess up the system like calling malloc() or free(). 177. */ 178. 179. static int wireCallback( const void *inputBuffer, void *outputBuffer, 180. unsigned long framesPerBuffer, 181. const PaStreamCallbackTimeInfo* timeInfo, 182. PaStreamCallbackFlags statusFlags, 183. void *userData ) 184. { 185. INPUT_SAMPLE * in; 186. OUTPUT_SAMPLE * out; 187. int inStride; 188. int outStride; 189. int inDone = 0; 190. int outDone = 0; 191. WireConfig_t *config = (WireConfig_t *) userData;

78

192. unsigned int i; 193. int inChannel, outChannel; 194. 195. // update window buffer size 196. prev_framesperBuffer = curr_framesperBuffer; 197. curr_framesperBuffer = framesPerBuffer; 198. 199. /* This may get called with NULL inputBuffer during initial setup. *

/ 200. if( inputBuffer == NULL || prev_framesperBuffer == 0) return 0; 201. 202. for (int k=0; k < 512; ++k) 203. INPUT_ARY[prev_framesperBuffer+k] = INPUT_ARY[k]; 204. 205. 206. inChannel=0, outChannel=0; 207. 208. while( !(inDone && outDone) ) 209. { 210. if( config-‐>isInputInterleaved ) 211. { 212. in = ((INPUT_SAMPLE*)inputBuffer) + inChannel; 213. inStride = config-‐>numInputChannels; 214. } 215. else 216. { 217. in = ((INPUT_SAMPLE**)inputBuffer)[inChannel]; 218. inStride = 1; 219. } 220. 221. if( config-‐>isOutputInterleaved ) 222. { 223. out = ((OUTPUT_SAMPLE*)outputBuffer) + outChannel; 224. outStride = config-‐>numOutputChannels; 225. } 226. else 227. { 228. out = ((OUTPUT_SAMPLE**)outputBuffer)[outChannel]; 229. outStride = 1; 230. } 231. 232. for( i=0; i<framesPerBuffer; i++ ) 233. { 234. *out = CONVERT_IN_TO_OUT( *in ); 235. if (!inDone) 236. { 237. INPUT_ARY[curr_framesperBuffer-‐i-‐1] = *in; 238. *out = OUTPUT_ARY[prev_framesperBuffer-‐i-‐1]; 239. } 240. out += outStride; 241. in += inStride; 242. } 243. 244. if(inChannel < (config-‐>numInputChannels -‐ 1)) inChannel++; 245. else inDone = 1; 246. if(outChannel < (config-‐>numOutputChannels -‐ 1)) outChannel++; 247. else outDone = 1; 248. } 249. 250. for (i=curr_framesperBuffer+512; i < M; ++i) 251. {

79

252. printf("\nmissing %i\n\n", i); 253. INPUT_ARY [i] = 0.0f; 254. } 255. 256. if (RDY == 0) 257. printf("\n\nERROR! DROPPED DATA VECTOR!\n\n"); 258. 259. RDY = 0; 260. 261. //RDY = runKernels (INPUT_ARY, OUTPUT_ARY); 262. // this fails for some unidentified reason 263. // not using ISR to run GPU code 264. 265. return paContinue; 266. } 267. 268. /*******************************************************************/ 269. //int main(void); 270. int main(void) 271. { 272. RDY = 1; 273. PaError err = paNoError; 274. WireConfig_t CONFIG; 275. WireConfig_t *config = &CONFIG; 276. int configIndex = 0;; 277. 278. err = Pa_Initialize(); 279. if( err != paNoError ) goto error; 280. 281. // ALLOCATES DATA ON GPU 282. initPLP (); 283. initAR (); 284. coeff [0] = 1.0; 285. 286. 287. printf("Please connect audio signal to input and listen for it on ou

tput!\n"); 288. printf("input format = %lu\n", INPUT_FORMAT ); 289. printf("output format = %lu\n", OUTPUT_FORMAT ); 290. printf("input device ID = %d\n", INPUT_DEVICE ); 291. printf("output device ID = %d\n", OUTPUT_DEVICE ); 292. 293. 294. 295. if( INPUT_FORMAT == OUTPUT_FORMAT ) 296. { 297. gInOutScaler = 1.0; 298. } 299. else if( (INPUT_FORMAT == paInt16) && (OUTPUT_FORMAT == paFloat32) )

300. { 301. gInOutScaler = 1.0/32768.0; 302. } 303. else if( (INPUT_FORMAT == paFloat32) && (OUTPUT_FORMAT == paInt16) )

304. { 305. gInOutScaler = 32768.0; 306. } 307. 308. config-‐>isInputInterleaved=0; 309. config-‐>isOutputInterleaved=0;

80

310. config-‐>numInputChannels=1; 311. config-‐>numOutputChannels=2; 312. config-‐>framesPerCallback=1536; 313. 314. 315. 316. printf("-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐\n" ); 317. printf("Configuration #%d\n", configIndex++ ); 318. err = TestConfiguration( config ); 319. /* Give user a chance to bail out. */ 320. 321. if( err == 1 ) 322. { 323. err = paNoError; 324. goto done; 325. } 326. else if( err != paNoError ) goto error; 327. 328. done: 329. Pa_Terminate(); 330. delMEM (); 331. // destroyCUBLASFunc (); 332. // free(matrix); 333. printf("\naudio streaming complete.\n"); fflush(stdout); 334. return 0; 335. 336. error: 337. Pa_Terminate(); 338. fprintf( stderr, "An error occured while using the portaudio stream\

n" ); 339. fprintf( stderr, "Error number: %d\n", err ); 340. fprintf( stderr, "Error message: %s\n", Pa_GetErrorText( err ) ); 341. printf("Hit ENTER to quit.\n"); fflush(stdout); 342. getchar(); 343. return -‐1; 344. } 345. 346. static PaError TestConfiguration( WireConfig_t *config ) 347. { 348. int c; 349. PaError err = paNoError; 350. PaStream *stream; 351. PaStreamParameters inputParameters, outputParameters; 352. 353. printf("input %sinterleaved!\n", (config-‐

>isInputInterleaved ? " " : "NOT ") ); 354. printf("output %sinterleaved!\n", (config-‐

>isOutputInterleaved ? " " : "NOT ") ); 355. printf("input channels = %d\n", config-‐>numInputChannels ); 356. printf("output channels = %d\n", config-‐>numOutputChannels ); 357. printf("framesPerCallback = %d\n", config-‐>framesPerCallback ); 358. 359. inputParameters.device = INPUT_DEVICE; /* default input

device */ 360. if (inputParameters.device == paNoDevice) { 361. fprintf(stderr,"Error: No default input device.\n"); 362. goto error; 363. } 364. inputParameters.channelCount = config-‐>numInputChannels; 365. inputParameters.sampleFormat = INPUT_FORMAT | (config-‐

>isInputInterleaved ? 0 : paNonInterleaved);

81

366. inputParameters.suggestedLatency = Pa_GetDeviceInfo( inputParameters.device )-‐>defaultLowInputLatency;

367. printf ("Input Latency %f\n", inputParameters.suggestedLatency); 368. inputParameters.hostApiSpecificStreamInfo = NULL; 369. 370. outputParameters.device = OUTPUT_DEVICE; /* default outpu

t device */ 371. if (outputParameters.device == paNoDevice) { 372. fprintf(stderr,"Error: No default output device.\n"); 373. goto error; 374. } 375. outputParameters.channelCount = config-‐>numOutputChannels; 376. outputParameters.sampleFormat = OUTPUT_FORMAT | (config-‐

>isOutputInterleaved ? 0 : paNonInterleaved); 377. outputParameters.suggestedLatency = Pa_GetDeviceInfo( outputParamete

rs.device )-‐>defaultLowOutputLatency; 378. printf ("Output Latency %f\n", outputParameters.suggestedLatency); 379. outputParameters.hostApiSpecificStreamInfo = NULL; 380. 381. err = Pa_OpenStream( 382. &stream, 383. &inputParameters, 384. &outputParameters, 385. SAMPLE_RATE, 386. config-‐>framesPerCallback, /* frames per buffer */ 387. paClipOff, /* we won't output out of range samples so don'

t bother clipping them */ 388. wireCallback, 389. config ); 390. if( err != paNoError ) goto error; 391. 392. printf("\nStarting audio stream...\n"); 393. 394. printf("Hit ENTER to start processing\n\n"); fflush(stdout); 395. c = getchar(); 396. 397. err = Pa_StartStream( stream ); 398. if( err != paNoError ) goto error; 399. 400. 401. gsl_vector * 402. x = gsl_vector_alloc (LP_ORDER); 403. gsl_permutation * 404. p = gsl_permutation_alloc (LP_ORDER); 405. 406. gsl_matrix_view 407. m; 408. gsl_vector_view 409. b; 410. int s; 411. LONGLONG Freq; 412. LONGLONG Now; 413. LONGLONG Last; 414. 415. 416. while (1) 417. { 418. if (RDY == 0) 419. { 420. // PreWhitening 421. runFFTConv (INPUT_ARY, PLP_INPUT_ARY, coeff);

82

422. 423. // PLP Filter 424. RDY = runKernels (PLP_INPUT_ARY, AR_INPUT_ARY, AUTOCORR); 425. 426. if (SINGULAR != 1) 427. { 428. // set AR_INPUT_ARY to OUTPUT_ARY 429. if (TIMING) 430. { 431. QueryPerformanceFrequency ( reinterpret_cast<LARGE_I

NTEGER*>(&Freq) ); 432. QueryPerformanceCounter( reinterpret_cast<LARGE_INTE

GER*>(&Last) ); 433. } 434. 435. // AR FILTER IMPLEMENTATION USING GNU SCIENTIFIC LIBRARY

-‐-‐GSL 436. // AUTOCORR vector computed in GPU in frequency domain 437. for (int i=0; i < (LP_ORDER); ++i) 438. for (int j=i; j < (LP_ORDER); ++j) 439. { 440. matrix[i+(LP_ORDER)*j] = AUTOCORR[j-‐i]; 441. matrix[j+i*(LP_ORDER)] = AUTOCORR[j-‐i]; 442. } 443. 444. for (int i=0; i < LP_ORDER; ++i) 445. { 446. b_vect[i] = -‐AUTOCORR[i+1]; 447. } 448. 449. m = gsl_matrix_view_array(matrix, LP_ORDER, LP_ORDER); 450. b = gsl_vector_view_array(b_vect, LP_ORDER); 451. 452. // matrix inversion using LU decomp 453. gsl_linalg_LU_decomp (&m.matrix, p, &s); 454. gsl_linalg_LU_solve (&m.matrix, p, &b.vector, x); 455. 456. coeff[0] = 1.0; 457. for (int i=0; i < LP_ORDER; ++i) 458. coeff[i+1] = gsl_vector_get(x, i); 459. 460. 461. if (TIMING) 462. { 463. QueryPerformanceCounter( reinterpret_cast<LARGE_INTE

GER*>(&Now) ); 464. LONGLONG EclapsedCount = Now -‐ Last; 465. LONGLONG TimerResolution = 1000; //Milliseconds 466. double Milliseconds = EclapsedCount * TimerResolutio

n / (double)Freq; 467. printf("Matrix Inversion Run speed: %3.3f ms\n", Mil

liseconds); 468. } 469. 470. runFFTConv (AR_INPUT_ARY, OUTPUT_ARY, coeff); 471. writeToFileOutput ((float*)OUTPUT_ARY); 472. getchar(); 473. } 474. else 475. { 476. for (int i=0; i < FILTER_W_SIZE; ++i)

83

477. OUTPUT_ARY [i] = INPUT_ARY[i]; 478. SINGULAR = 0; 479. } 480. 481. } 482. } 483. 484. gsl_permutation_free (p); 485. gsl_vector_free(x); 486. 487. 488. done: 489. printf("Closing stream.\n"); 490. err = Pa_CloseStream( stream ); 491. if( err != paNoError ) goto error; 492. return 1; 493. 494. error: 495. return err; 496. 497. } 498. 499. 500. 501. // helpers 502. void ImportFromFile (float* MyNumbers) 503. { 504. std::fstream 505. myfile; 506. 507. myfile.open("res_filters.dat"); 508. 509. for (int i=0; i < K_MAX-‐K_MIN; ++i) { 510. for (int j=0; j < 8; ++j) { 511. for (int k=0; k < 512; ++k) { 512. myfile >> (MyNumbers)[(i*8)+(j*512)+k]; 513. } 514. } 515. } 516. 517. myfile.close(); 518. 519. return; 520. } 521. 522. void writeToFileTaps (float* H_ResidualFilterTaps) 523. { 524. std :: ofstream 525. myfile; 526. 527. myfile.open ("taps.dat"); 528. 529. for (int i=0; i < (K_MAX-‐K_MIN); ++i) 530. { 531. for (int j=0; j < 8; ++j) { 532. myfile << (H_ResidualFilterTaps)[(i*8)+j]; 533. myfile << " "; 534. } 535. myfile << std :: endl; 536. } 537. myfile.close();

84

538. 539. return; 540. } 541. 542. 543. void writeToFileTapVectors (float* H_ResidualFilterVectors) 544. { 545. std :: ofstream 546. myfile; 547. 548. myfile.open ("vectors.dat"); 549. 550. 551. for (int i=0; i < 8; ++i) 552. { 553. for (int j=0; j < (K_MAX-‐K_MIN); ++j) 554. { 555. for (int k=0; k < FILTER_W_SIZE; ++k) 556. { 557. myfile << (H_ResidualFilterVectors)[i*i_INC+j*j_INC+k];

558. myfile << " "; 559. } 560. myfile << std :: endl; 561. } 562. } 563. myfile.close(); 564. 565. return; 566. } 567. 568. void writeToFileMatrix (float* H_ResidualMatrix) 569. { 570. std :: ofstream 571. myfile; 572. 573. myfile.open ("residual.dat"); 574. 575. for (int j=0; j < (K_MAX-‐K_MIN); ++j) 576. { 577. for (int i=0; i < 8; ++i) 578. { 579. myfile << (H_ResidualMatrix)[i*(K_MAX-‐K_MIN)+j]; 580. myfile << " "; 581. } 582. myfile << std :: endl; 583. } 584. myfile.close(); 585. 586. return; 587. } 588. 589. void writeToFileOutput (float* H_Output) 590. { 591. std :: ofstream 592. myfile; 593. 594. myfile.open ("t_output.dat"); 595. 596. for (int i=0; i < FILTER_W_SIZE; ++i) 597. {

85

598. myfile << (H_Output)[i]; 599. myfile << " "; 600. } 601. myfile.close(); 602. 603. return; 604. }

CUDA GPU Driver File

1. /* 2. * Cascade_PLP.cu 3. * Cascade_PLP 4. * 5. * MAIN GPU DRIVER FILE 6. * runKernels executes PLP and computes AUTOCORR vector for AR filter 7. * 8. * Omer Osman 9. * July 2011 10. * 11. */ 12. 13. #include "Cascade.h" 14. #include <CUDA.h> 15. #include <cuda_runtime_api.h> 16. #include <cufft.h> 17. 18. #include "coeffs/IFw_taps.dat" 19. #include "coeffs/note.dat" 20. #include "coeffs/polyphase.dat" 21. //#include "coeffs/three.txt" 22. #include "coeffs/x1.txt" 23. 24. typedef float2 Complex; 25. 26. __constant__ float 27. D_IFw_Taps [FRAC_DELAYS*(2*INTERP_HALF_ORDER)]; 28. 29. 30. // useful function copied from nvidia developer forums // 31. // notifies at runtime of any errors in CUDA function executions failures 32. static void HandleError( cudaError_t err, 33. const char *file, 34. int line ) { 35. if (err != cudaSuccess) { 36. printf( "%s in %s at line %d\n", cudaGetErrorString( err ), 37. file, line ); 38. exit( EXIT_FAILURE ); 39. } 40. } 41. #define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) 42. #define HANDLE_NULL( a ) {if (a == NULL) { \ 43. printf( "Host memory failed in %s at line %d\n", \ 44. __FILE__, __LINE__ ); \ 45. exit( EXIT_FAILURE );}}

86

46. // end copied functions 47. 48. 49. // Data Constructors/Destructor 50. extern volatile int SINGULAR; 51. extern "C" void initPLP (); 52. extern "C" void initAR (); 53. extern "C" void delMEM (); 54. void init_IFw_taps(); 55. void init_X_8x (Complex**, Complex**, volatile float*); 56. void del_X_8x (Complex**); 57. int init_PolyphaseFIR (Complex**, cufftHandle&); 58. void initPolyphaseData (Complex**, Complex**, Complex**); 59. void initResidualFilterData (Complex**, Complex**, Complex**); 60. void del_PolyphaseFIR (cufftHandle&, Complex**); 61. void initPLPIR (float**, cufftHandle &, Complex**, Complex**, Complex**, Complex

**, Complex**, Complex**); 62. void delPLPIR (float**, cufftHandle &, Complex**, Complex**, Complex**); 63. void copyPolyphaseOutput (Complex*); 64. 65. // Residual Filters 66. void setupResidualFilterTap (float**, float**); 67. void delResidualFilterTap (float**, float**); 68. 69. // modifiers 70. int fftPadData(const Complex*, Complex**, int, int, int); 71. int fftPadKernel(const Complex*, Complex**, int, int, int); 72. int fftPadDataCentered(const Complex*, Complex**, int, int, int); 73. int fftPadKernelCentered(const Complex*, Complex**, int, int, int); 74. 75. // execution 76. extern "C" int runKernels (volatile float*, volatile float*, float*); 77. extern "C" void runFFTConv (volatile float*, volatile float*, float*); 78. float cuda_malloc_test( int size, bool up ); 79. float cuda_host_alloc_test( int size, bool up ); 80. void runPolyphaseFFTConv (cufftHandle &, Complex*, Complex*, Complex*, Complex*,

int); 81. void findBestMSE (float*, int &, int &); 82. int calc3TapCoeff (Complex*, Complex*, Complex*, int, int); 83. void runPLPConvandAutocorr (float*, float*, Complex*, Complex*, Complex*, Comple

x*, cufftHandle &, int, int); 84. 85. // helper functions copied from elsewhere, perhaps CUDA by Examples book 86. void chkCudaReturn(cudaError_t err, unsigned int myErrLoc); 87. void printMemUsage (); 88. 89. // HELPERS FOR DEBUGGING 90. extern "C" void ImportFromFile (float**); 91. extern "C" void writeToFileTaps (float**); 92. extern "C" void writeToFileTapVectors (float**); 93. extern "C" void writeToFileMatrix (float**); 94. extern "C" void writeToFileOutput (float**); 95. 96. // kernels for GPU 97. static __device__ __host__ inline Complex ComplexAdd(Complex, Complex); 98. static __device__ __host__ inline Complex ComplexScale(Complex, float); 99. static __device__ __host__ inline Complex ComplexMul(Complex, Complex); 100. static __device__ __host__ inline Complex ComplexConjMul(Complex, Comple

x); 101. static __global__ void ComplexPointwiseMulAndScale(Complex*, const Compl

ex*, int, float);

87

102. static __global__ void ResidualComplexPointwiseMulAndScale(Complex*, const Complex*, int, float);

103. static __global__ void findOne_Tap_Coeffs (Complex*, Complex*, float*, Complex*);

104. static __global__ void ResidualCalc (Complex*, float*); 105. static __global__ void FFTAutocorr (Complex*, int); 106. 107. 108. typedef struct ARdata_s 109. { 110. Complex* 111. H_X; 112. Complex* 113. H_ARcoeff; 114. Complex* 115. D_X; 116. Complex* 117. D_ARcoeff; 118. 119. cufftHandle ARFilter; 120. 121. } ARdata_t; 122. 123. static ARdata_t AR; 124. 125. typedef struct PLPdata_s 126. { 127. int 128. RUNNING; 129. 130. float* 131. H_OutputSig; 132. Complex* 133. H_PLPIR; 134. Complex* 135. D_PLPIR; 136. Complex* 137. D_PLPIR_O; 138. Complex* 139. D_X; 140. Complex* 141. D_X_O; 142. 143. // Interpolated Input Signal 144. Complex* 145. H_X_8x; 146. Complex* 147. H_X; 148. 149. // padded signal and filter data 150. Complex* D_Polyphase_O; 151. Complex* D_PaddedSignal; 152. Complex* D_PaddedSignal2; // for residual calculation 153. Complex* D_PaddedResidualFiltSignal; 154. Complex* D_ResidFilt_O; 155. Complex* D_FilterKernel; 156. Complex* H_ConvolvedSignal; 157. 158. // Polyphase Filter 159. cufftHandle PolyphaseFIR; 160. cufftHandle ResidualFIR;

88

161. cufftHandle FracDelayResidual; 162. cufftHandle PLPResidual; 163. 164. // 1-‐tap Predictor 165. float* D_ResidualFilterTap; 166. float* H_ResidualFilterTap; 167. float* D_ResidualMatrix; 168. float* H_ResidualMatrix; 169. 170. // 1-‐tap Residual Filter FFT Setup 171. Complex* D_ResidFiltVect_O; 172. Complex* D_ResidualFilterVectors; 173. Complex* H_ResidualFilterVectors; 174. 175. // autocorr 176. float* H_Autocorr; 177. 178. int LENGTH; 179. 180. } PLPdata_t; 181. 182. static PLPdata_t PLP; 183. 184. void initAR () 185. { 186. int 187. ERROR_TYPE; 188. 189. if (VERBOSE) 190. { 191. printf("Initializing HOST and DEVICE memory spaces for AR filter

...\n"); 192. printMemUsage(); 193. } 194. 195. ERROR_TYPE = cufftPlan1d (&AR.ARFilter, FILTER_W_SIZE, CUFFT_C2C, 1)

; 196. if (ERROR_TYPE != CUFFT_SUCCESS) 197. fprintf(stderr, "ERROR UNABLE TO SETUP RESIDUALS FFT: %d\n", ERR

OR_TYPE); 198. 199. HANDLE_ERROR (cudaHostAlloc((void**)&AR.H_X, sizeof(Complex)*FILTER_

W_SIZE, cudaHostAllocDefault)); 200. HANDLE_ERROR (cudaHostAlloc((void**)&AR.H_ARcoeff, sizeof(Complex)*F

ILTER_W_SIZE, cudaHostAllocDefault)); 201. HANDLE_ERROR (cudaMalloc((void**)&AR.D_X, sizeof(Complex)*FILTER_W_S

IZE)); 202. HANDLE_ERROR (cudaMalloc((void**)&AR.D_ARcoeff, sizeof(Complex)*FILT

ER_W_SIZE)); 203. 204. 205. for (int i=0; i < FILTER_W_SIZE; ++i) 206. { 207. AR.H_X[i].x = 0.0f; 208. AR.H_X[i].y = 0.0f; 209. AR.H_ARcoeff[i].x = 0.0f; 210. AR.H_ARcoeff[i].y = 0.0f; 211. } 212. 213. HANDLE_ERROR (cudaMemcpy (AR.D_X, AR.H_X, sizeof(Complex)*FILTER_W_S

IZE, cudaMemcpyHostToDevice));

89

214. HANDLE_ERROR (cudaMemcpy (AR.D_ARcoeff, AR.H_ARcoeff, sizeof(Complex)*FILTER_W_SIZE, cudaMemcpyHostToDevice));

215. 216. if (VERBOSE) 217. { 218. printMemUsage (); 219. printf("Done allocating AR fitler mem spaces\n"); 220. } 221. 222. return; 223. } 224. 225. void runFFTConv (volatile float* INPUTDATA, volatile float* OUTPUTDATA,

float* COEFFS) 226. { 227. int 228. ERROR_TYPE; 229. 230. for (int i=0; i < FILTER_W_SIZE; ++i) 231. { 232. AR.H_X[i].x = INPUTDATA[i]; 233. AR.H_X[i].y = 0.0f; 234. AR.H_ARcoeff[i].x = 0.0f; 235. AR.H_ARcoeff[i].y = 0.0f; 236. } 237. 238. for (int i=0; i < LP_ORDER+1; ++i) 239. AR.H_ARcoeff[i].x = COEFFS[i]; 240. 241. HANDLE_ERROR (cudaMemcpy (AR.D_X, AR.H_X, sizeof(Complex)*FILTER_W_S

IZE, cudaMemcpyHostToDevice)); 242. HANDLE_ERROR (cudaMemcpy (AR.D_ARcoeff, AR.H_ARcoeff, sizeof(Complex

)*FILTER_W_SIZE, cudaMemcpyHostToDevice)); 243. 244. 245. ERROR_TYPE = cufftExecC2C(AR.ARFilter, (cufftComplex *)AR.D_X, (cuff

tComplex *)AR.D_X, CUFFT_FORWARD); 246. if (ERROR_TYPE != CUFFT_SUCCESS) 247. fprintf(stderr, "FAILED to perform forward FFT: %d\n", ERROR_TYP

E); 248. 249. ERROR_TYPE = cufftExecC2C(AR.ARFilter, (cufftComplex *)AR.D_ARcoeff,

(cufftComplex *)AR.D_ARcoeff, CUFFT_FORWARD); 250. if (ERROR_TYPE != CUFFT_SUCCESS) 251. fprintf(stderr, "FAILED to perform forward FFT: %d\n", ERROR_TYP

E); 252. 253. // Multiply the coefficients together and normalize the result 254. ResidualComplexPointwiseMulAndScale<<<32, 256>>> 255. (AR.D_X, AR.D_ARcoeff, FILTER_W_SIZE, 1.0f / FILTER_W_SIZE); 256. chkCudaReturn(cudaGetLastError(),3); 257. 258. 259. // Transform signal back 260. if (cufftExecC2C(AR.ARFilter, (cufftComplex *)AR.D_X, (cufftComplex

*)AR.D_X, CUFFT_INVERSE) != CUFFT_SUCCESS) 261. fprintf(stderr, "FAILED to perform inverse FFT of convolved spec

trum\n"); 262. 263.

90

264. HANDLE_ERROR (cudaMemcpy (AR.H_X, AR.D_X, sizeof(Complex)*FILTER_W_SIZE, cudaMemcpyDeviceToHost));

265. 266. 267. for (int i=0; i < FILTER_W_SIZE; ++i) 268. { 269. OUTPUTDATA[i] = AR.H_X[i].x; 270. } 271. 272. 273. return; 274. } 275. 276. 277. void initPLP () 278. { 279. // Initialization Values 280. Complex Zero; 281. Complex One; 282. 283. Zero.x = 0.0; 284. Zero.y = 0.0; 285. One.x = 1.0; 286. One.y = 0.0; 287. 288. int ERROR_TYPE = 0; 289. int FFTwidth = FILTER_W_SIZE; 290. 291. // INIT MEM // 292. if (VERBOSE) 293. { 294. printf("Initializing HOST and DEVICE memory spaces for PLP filte

r...\n"); 295. printMemUsage(); 296. } 297. 298. HANDLE_ERROR (cudaHostAlloc((void**)&PLP.H_Autocorr, sizeof(float)*L

P_ORDER+1, cudaHostAllocDefault)); 299. 300. init_IFw_taps(); 301. PLP.LENGTH = init_PolyphaseFIR(&PLP.D_FilterKernel, PLP.PolyphaseFIR

); 302. HANDLE_ERROR (cudaHostAlloc((void**)&PLP.H_ResidualMatrix, sizeof(fl

oat)*(K_MAX-‐K_MIN)*FRAC_DELAYS, cudaHostAllocDefault)); 303. 304. init_X_8x(&PLP.H_X, &PLP.H_X_8x, NULL); 305. initPolyphaseData (&PLP.H_X_8x, &PLP.D_PaddedSignal, &PLP.D_Polyphas

e_O); 306. 307. initResidualFilterData (&PLP.H_X, &PLP.D_PaddedResidualFiltSignal, &

PLP.D_ResidFilt_O); 308. cudaHostAlloc((void**)&PLP.H_ConvolvedSignal, sizeof(Complex)*PLP.LE

NGTH, cudaHostAllocDefault); 309. HANDLE_ERROR (cudaMalloc ((void**)&PLP.D_ResidualMatrix, sizeof(floa

t)*FRAC_DELAYS*(K_MAX-‐K_MIN))); 310. 311. setupResidualFilterTap (&PLP.D_ResidualFilterTap, &PLP.H_ResidualFil

terTap); 312. 313. // allocating residual filter vectors // 314. if (VERBOSE)

91

315. { 316. printf("Allocating Residual Filter Vectors\n"); 317. printMemUsage (); 318. } 319. initPLPIR (&PLP.H_OutputSig, PLP.PLPResidual, &PLP.H_X, &PLP.D_X, &P

LP.D_X_O, &PLP.H_PLPIR, &PLP.D_PLPIR, &PLP.D_PLPIR_O); 320. 321. ERROR_TYPE = cufftPlan1d (&PLP.FracDelayResidual, FFTwidth, CUFFT_C2

C, 8*(K_MAX-‐K_MIN)); 322. if (ERROR_TYPE != CUFFT_SUCCESS) 323. fprintf(stderr, "ERROR UNABLE TO SETUP RESIDUALS FFT: %d\n", ERR

OR_TYPE); 324. ERROR_TYPE = cufftPlan1d (&PLP.ResidualFIR, FFTwidth, CUFFT_C2C, 1);

325. if (ERROR_TYPE != CUFFT_SUCCESS) 326. fprintf(stderr, "ERROR UNABLE TO SETUP RESIDUALS FFT: %d\n", ERR

OR_TYPE); 327. 328. 329. HANDLE_ERROR (cudaMalloc((void**)&PLP.D_ResidualFilterVectors, sizeo

f(Complex)*((K_MAX-‐K_MIN)*FILTER_W_SIZE)*8)); 330. HANDLE_ERROR (cudaMalloc((void**)&PLP.D_ResidFiltVect_O, sizeof(Comp

lex)*((K_MAX-‐K_MIN)*FILTER_W_SIZE)*8)); 331. HANDLE_ERROR (cudaHostAlloc((void**)&PLP.H_ResidualFilterVectors, si

zeof(Complex)*((K_MAX-‐K_MIN)*FILTER_W_SIZE)*8, cudaHostAllocDefault)); 332. 333. // setting initial state 334. for (int i=0; i < 8; ++i) 335. { 336. for (int j=0; j < (K_MAX-‐K_MIN); ++j) 337. { 338. for (int k=0; k < FILTER_W_SIZE; ++k) 339. PLP.H_ResidualFilterVectors[i*i_INC+j*j_INC+k] = Zero; 340. PLP.H_ResidualFilterVectors[i*i_INC+j*j_INC] = One; 341. } 342. } 343. 344. // copy initial state in to device 345. HANDLE_ERROR (cudaMemcpy (PLP.D_ResidualFilterVectors, PLP.H_Residua

lFilterVectors, sizeof(Complex)*((K_MAX-‐K_MIN)*FILTER_W_SIZE)*8, cudaMemcpyHostToDevice));

346. 347. if (VERBOSE) 348. { 349. printf("Residual vectors allocated\n"); 350. printMemUsage (); 351. } 352. 353. 354. if (VERBOSE) 355. printf("Initialization Complete.\n\n"); 356. 357. PLP.RUNNING = 0; 358. 359. return; 360. } 361. 362. void delMEM () 363. { 364. // Clear MEM // 365. printf("\nDeallocating HOST and DEVICE memory spaces...\n");

92

366. 367. HANDLE_ERROR (cudaFreeHost (PLP.H_Autocorr)); 368. HANDLE_ERROR (cudaFreeHost (PLP.H_ResidualMatrix)); 369. delResidualFilterTap (&PLP.D_ResidualFilterTap, &PLP.H_ResidualFilte

rTap); 370. HANDLE_ERROR (cudaFree (PLP.D_ResidualFilterVectors)); 371. HANDLE_ERROR (cudaFree (PLP.D_ResidFiltVect_O)); 372. HANDLE_ERROR (cudaFreeHost (PLP.H_ResidualFilterVectors)); 373. 374. HANDLE_ERROR (cudaFree(PLP.D_Polyphase_O)); 375. HANDLE_ERROR (cudaFree(PLP.D_PaddedSignal)); 376. HANDLE_ERROR (cudaFree(PLP.D_PaddedResidualFiltSignal)); 377. HANDLE_ERROR (cudaFree(PLP.D_ResidFilt_O)); 378. HANDLE_ERROR (cudaFree (PLP.D_ResidualMatrix)); 379. del_PolyphaseFIR (PLP.PolyphaseFIR, &PLP.D_FilterKernel); 380. HANDLE_ERROR (cudaFreeHost(PLP.H_ConvolvedSignal)); 381. del_X_8x(&PLP.H_X_8x); 382. HANDLE_ERROR (cudaFreeHost(PLP.H_X)); 383. delPLPIR (&PLP.H_OutputSig, PLP.PLPResidual, &PLP.D_X, &PLP.H_PLPIR,

&PLP.D_PLPIR); 384. HANDLE_ERROR(cudaFree(PLP.D_X_O)); 385. HANDLE_ERROR(cudaFree(PLP.D_X)); 386. HANDLE_ERROR(cudaFree(PLP.D_PLPIR_O)); 387. cufftDestroy(PLP.FracDelayResidual); // ADD ERROR CHECKING 388. chkCudaReturn(cudaGetLastError(),3); 389. 390. cudaThreadExit(); 391. 392. return; 393. } 394. 395. 396. int runKernels (volatile float* INPUT, volatile float* OUTPUT, float* AU

TOCORR) 397. { 398. int ERROR_TYPE = 0; 399. 400. int lag, frac; 401. 402. // Event Timing 403. cudaEvent_t start, stop; 404. float elapsedTime; 405. 406. 407. 408. if (TIMING) 409. { 410. // START EVENT TIMER // 411. HANDLE_ERROR( cudaEventCreate( &start ) ); 412. HANDLE_ERROR( cudaEventCreate( &stop ) ); 413. HANDLE_ERROR( cudaEventRecord( start, 0 ) ); 414. } 415. 416. 417. // update H_X and H_X_8x 418. for (int i=0; i < M; ++i) 419. { 420. PLP.H_X[i].x = INPUT[i]; 421. PLP.H_X_8x[8*i].x = INPUT[i]; 422. } 423.

93

424. // copy to PaddedSig and D_X 425. if (PLP.H_X == NULL) 426. printf("h_X\n\n\n"); 427. if (PLP.D_X == NULL) 428. printf("D_X\n\n\n"); 429. HANDLE_ERROR(cudaMemcpy(PLP.D_X, PLP.H_X, FILTER_W_SIZE*sizeof(Compl

ex), cudaMemcpyHostToDevice)); 430. HANDLE_ERROR(cudaMemcpy(PLP.D_PaddedSignal, PLP.H_X_8x, (8*M+POLYPHA

SE_PAD)*sizeof(Complex), cudaMemcpyHostToDevice)); 431. HANDLE_ERROR(cudaMemcpy(PLP.D_PaddedResidualFiltSignal, PLP.H_X, FIL

TER_W_SIZE*sizeof(Complex), cudaMemcpyHostToDevice)); 432. 433. //////////////////////////////////////////////////////////////// 434. // POLYPHASE FILTER 435. ////////////////////////////////////////////////////////////////

436. runPolyphaseFFTConv (PLP.PolyphaseFIR, PLP.D_FilterKernel, PLP.D_Pad

dedSignal, PLP.D_Polyphase_O, PLP.H_ConvolvedSignal, PLP.LENGTH); 437. chkCudaReturn(cudaGetLastError(),3); 438. 439. if (VERBOSE) 440. { 441. printf("\nFinding one-‐tap filter Coefficients\n"); 442. } 443. 444. //////////////////////////////////////////////////////////////// 445. // ONE TAP FILTER COEFFICIENTS CALCULATION 446. ////////////////////////////////////////////////////////////////

447. findOne_Tap_Coeffs <<<8, (K_MAX-‐

K_MIN)>>> (PLP.D_X, PLP.D_Polyphase_O, PLP.D_ResidualFilterTap, PLP.D_ResidualFilterVectors);

448. chkCudaReturn(cudaGetLastError(),3); 449. 450. // RESIDUAL FILTERING CALCULATION for all frac/bulk delays // 451. ERROR_TYPE = cufftExecC2C(PLP.FracDelayResidual, (cufftComplex *)PLP

.D_ResidualFilterVectors, (cufftComplex *)PLP.D_ResidFiltVect_O, CUFFT_FORWARD);

452. if (ERROR_TYPE != CUFFT_SUCCESS) 453. fprintf(stderr, "ERROR UNABLE TO RUN RESIDUALS FFT: %d\n", ERROR

_TYPE); 454. 455. if (cufftExecC2C(PLP.ResidualFIR, (cufftComplex *)PLP.D_PaddedResidu

alFiltSignal, (cufftComplex *)PLP.D_ResidFilt_O, CUFFT_FORWARD) != CUFFT_SUCCESS)

456. fprintf(stderr, "FAILED to perform forward FFT of Padded Signal\n");

457. 458. // Multiply the coefficients together and normalize the result 459. ResidualComplexPointwiseMulAndScale<<<32, 256>>> 460. (PLP.D_ResidFiltVect_O, PLP.D_ResidFilt_O, FRAC_DELAYS*(K_MAX-‐

K_MIN)*FILTER_W_SIZE, 1.0f / FILTER_W_SIZE); 461. chkCudaReturn(cudaGetLastError(),3); 462. 463. // Transform signal back 464. if (cufftExecC2C(PLP.FracDelayResidual, (cufftComplex *)PLP.D_ResidF

iltVect_O, (cufftComplex *)PLP.D_ResidFiltVect_O, CUFFT_INVERSE) != CUFFT_SUCCESS)

465. fprintf(stderr, "FAILED to perform inverse FFT of convolved spectrum\n");

466.

94

467. ResidualCalc <<<8, (K_MAX-‐K_MIN)>>> (PLP.D_ResidFiltVect_O, PLP.D_ResidualMatrix);

468. 469. //////////////////////////////////////////////////////////////// 470. // 3T PLP FILTER CALCULATION 471. ////////////////////////////////////////////////////////////////

472. HANDLE_ERROR (cudaMemcpy (PLP.H_ResidualMatrix, PLP.D_ResidualMatrix

, sizeof(float)*(K_MAX-‐K_MIN)*FRAC_DELAYS, cudaMemcpyDeviceToHost)); 473. 474. findBestMSE (PLP.H_ResidualMatrix, lag, frac); 475. SINGULAR = calc3TapCoeff (PLP.H_PLPIR, PLP.H_ConvolvedSignal, PLP.H_

X, lag, frac); // ConvolvedSignal is polyphase FIR output 476. 477. if (SINGULAR == 1) 478. return 2; 479. 480. float* H_ResidualFilterTap; 481. HANDLE_ERROR (cudaMemcpy (PLP.D_PLPIR, PLP.H_PLPIR, sizeof(Complex)*

FILTER_W_SIZE, cudaMemcpyHostToDevice)); 482. 483. //////////////////////////////////////////////////////////////// 484. // OUTPUT VECTOR CALCULATION AND AUTOCORR VECT CALC 485. //////////////////////////////////////////////////////////////// 486. runPLPConvandAutocorr (AUTOCORR, PLP.H_OutputSig, PLP.D_PLPIR, PLP.D

_PLPIR_O, PLP.D_X, PLP.D_X_O, PLP.PLPResidual, lag, frac); 487. 488. 489. for (int i=0; i < M-‐512; ++i) 490. OUTPUT[i] = PLP.H_OutputSig[i]; 491. 492. 493. // STOP EVENT TIMER /// 494. if (TIMING) 495. { 496. HANDLE_ERROR( cudaEventRecord( stop, 0 ) ); 497. HANDLE_ERROR( cudaEventSynchronize( stop ) ); 498. HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime, 499. start, stop ) ); 500. HANDLE_ERROR( cudaEventDestroy( start ) ); 501. HANDLE_ERROR( cudaEventDestroy( stop ) ); 502. printf("TOTAL PROCESSING TIME: %2.3f ms\n", elapsedTime); 503. } 504. 505. 506. PLP.RUNNING = 1; 507. 508. return 1; 509. } 510. 511. // Residual Filters 512. void setupResidualFilterTap (float** D_ResidualFilterTap, float** H_Resi

dualFilterTap) 513. { 514. if (DEBUG_ON) 515. printf("Allocating Space for Residual Filters Tap Coefficient\n"

); 516. 517. HANDLE_ERROR (cudaMalloc((void**)&(*D_ResidualFilterTap), sizeof(flo

at)*8*(K_MAX-‐K_MIN)));

95

518. HANDLE_ERROR (cudaHostAlloc((void**)&(*H_ResidualFilterTap), sizeof(float)*8*(K_MAX-‐K_MIN), cudaHostAllocDefault));

519. 520. // SET TO ZERO FOR SAFETY 521. 522. return; 523. } 524. 525. void delResidualFilterTap (float** D_ResidualFilterTap, float** H_Residu

alFilterTap) 526. { 527. if (DEBUG_ON) 528. printf("Deallocating Space for Residual Filters Tap Coefficients

\n"); 529. 530. HANDLE_ERROR (cudaFree (*D_ResidualFilterTap)); 531. HANDLE_ERROR (cudaFreeHost(*H_ResidualFilterTap)); 532. 533. return; 534. } 535. 536. 537. // DATA CONSTRUCTORS/DESTRUCTORS ////////////////////// 538. void init_IFw_taps () 539. { 540. if (DEBUG_ON) 541. printf("Initializing IFw_Taps in DEVICE CONST MEMORY\n"); 542. 543. float* 544. H_IFw_Taps = NULL; 545. 546. // allocate pinned memory on HOST 547. HANDLE_ERROR (cudaHostAlloc((void**)&H_IFw_Taps, FRAC_DELAYS*(2*INTE

RP_HALF_ORDER)*sizeof(*H_IFw_Taps), cudaHostAllocDefault)); 548. 549. // Fill Data on HOST 550. for (int i=0; i < FRAC_DELAYS; ++i) { 551. for (int j=0; j < 2*INTERP_HALF_ORDER; ++j) { 552. H_IFw_Taps[(i*2*INTERP_HALF_ORDER)+j] = in_IFw_taps[i][j]; 553. } 554. } 555. 556. // copy data from HOST to DEVICE CONST MEM 557. HANDLE_ERROR (cudaMemcpyToSymbol(D_IFw_Taps, H_IFw_Taps, FRAC_DELAYS

*(2*INTERP_HALF_ORDER)*sizeof(*H_IFw_Taps), 0, cudaMemcpyHostToDevice)); 558. 559. // Clear Host MEM 560. HANDLE_ERROR (cudaFreeHost(H_IFw_Taps)); 561. 562. if (DEBUG_ON) 563. printf("Completed initialization of IFw_Taps in DEVICE CONST MEMORY\

n"); 564. 565. return; 566. } 567. 568. 569. void init_X_8x (Complex** H_X, Complex** H_X_8x, volatile float* INPUT)

570. { 571. float*

96

572. Temp = NULL; 573. 574. if (DEBUG_ON) 575. printf("Initializing Host side Input Signal\n"); 576. 577. // allocate pinned memory on HOST 578. if ((*H_X) == NULL) 579. { 580. HANDLE_ERROR (cudaHostAlloc((void**)&(*H_X), FILTER_W_SIZE*sizeo

f(Complex), cudaHostAllocDefault)); 581. HANDLE_ERROR (cudaHostAlloc((void**)&(*H_X_8x), (FRAC_DELAYS*M+P

OLYPHASE_PAD)*sizeof(Complex), cudaHostAllocDefault)); 582. } 583. 584. // Fill Data on HOST 585. for (int i=0; i < FRAC_DELAYS*M; ++i) { 586. (*H_X_8x)[i].x = 0.0; 587. (*H_X_8x)[i].y = 0.0; 588. } 589. 590. for (int i=0; i < M; ++i) { 591. (*H_X)[i].x = 0.0f; 592. (*H_X)[i].y = 0.0f; 593. (*H_X_8x)[8*i].x = 0.0f; 594. } 595. 596. for (int i=8*M; i < 8*M+POLYPHASE_PAD; ++i) 597. { 598. (*H_X_8x)[i].x = 0.0; 599. (*H_X_8x)[i].y = 0.0; 600. } 601. 602. if (DEBUG_ON) 603. printf("Completed initialization of Input Signal\n"); 604. 605. return; 606. } 607. 608. void del_X_8x (Complex** H_X_8x) 609. { 610. if (DEBUG_ON) 611. printf("Clearing Host side Input Signal\n"); 612. // Clear Host MEM 613. HANDLE_ERROR (cudaFreeHost(*H_X_8x)); 614. 615. return; 616. } 617. 618. int init_PolyphaseFIR (Complex** D_FilterKernel, cufftHandle & Polyphase

_FIR) 619. { 620. int 621. ERROR_TYPE=0; 622. 623. if (DEBUG_ON) 624. printf("Initializing Polyphase FIR...\n"); 625. 626. Complex* H_PaddedFilterKernel = NULL; 627. Complex* H_PolyphaseCoeffs = NULL; 628. 629. // allocate pinned memory on HOST

97

630. HANDLE_ERROR (cudaHostAlloc((void**)&H_PolyphaseCoeffs, POLYPHASE_COEFFS*sizeof(*H_PolyphaseCoeffs), cudaHostAllocDefault));

631. 632. 633. // Initialize 634. for (int i=0; i < POLYPHASE_COEFFS; ++i) { 635. H_PolyphaseCoeffs[i].x = 0.0; 636. H_PolyphaseCoeffs[i].y = 0.0; 637. } 638. 639. // Fill Coeffs on HOST 640. for (int i=0; i < POLYPHASE_COEFFS; ++i) { 641. H_PolyphaseCoeffs[i].x = in_polyphase_coeff[i]; 642. H_PolyphaseCoeffs[i].y = 0.0; 643. } 644. 645. 646. // Pad filter kernel 647. int new_size = fftPadKernel(H_PolyphaseCoeffs, &H_PaddedFilterKernel

, POLYPHASE_COEFFS, SIGNAL_SIZE, POLYPHASE_PAD); 648. int mem_size = sizeof(Complex) * new_size; 649. 650. // Allocate device memory for filter kernel 651. HANDLE_ERROR(cudaMalloc((void**)&(*D_FilterKernel), mem_size)); 652. 653. // Copy filter kernel to device 654. HANDLE_ERROR(cudaMemcpy((*D_FilterKernel), H_PaddedFilterKernel, mem

_size, cudaMemcpyHostToDevice)); 655. 656. 657. // CUFFT plan 658. ERROR_TYPE = cufftPlan1d(&Polyphase_FIR, new_size, CUFFT_C2C, 1);

659. if (ERROR_TYPE != CUFFT_SUCCESS) 660. printf("\nERROR!! CANNOT INIT Polyphase FIR\n\n"); 661. chkCudaReturn(cudaGetLastError(),3); 662. 663. // Clear Host Memory 664. HANDLE_ERROR (cudaFreeHost(H_PaddedFilterKernel)); 665. HANDLE_ERROR (cudaFreeHost(H_PolyphaseCoeffs)); 666. 667. if (DEBUG_ON) 668. printf("Completed Polypase FIR Filter Initialization\n"); 669. 670. return new_size; 671. } 672. 673. void del_PolyphaseFIR (cufftHandle & PolyphaseFIR, Complex** D_FilterKer

nel) 674. { 675. if (DEBUG_ON) 676. printf("Clearing Polyphase FIR from Device\n"); 677. 678. // Clear DEVICE MEM 679. HANDLE_ERROR (cudaFree(*D_FilterKernel)); 680. 681. cufftDestroy(PolyphaseFIR); // ADD ERROR CHECKING 682. chkCudaReturn(cudaGetLastError(),3); 683. 684. return; 685. }

98

686. 687. void initResidualFilterData (Complex** Signal, Complex** D_PaddedSignal,

Complex** D_ResidFilt_O) 688. { 689. if (DEBUG_ON) 690. printf("Initializing Residual FIR Data\n"); 691. Complex* H_PaddedSignal; 692. 693. // pad data // 694. int new_size = fftPadData(*Signal, &H_PaddedSignal, M, FILTER_W_SIZE

, FILTER_W_SIZE-‐M); 695. int mem_size = sizeof(Complex) * new_size; 696. 697. // Allocate DEVICE memory for Padded Signal 698. if (*D_PaddedSignal == NULL) 699. { 700. HANDLE_ERROR(cudaMalloc((void**)&(*D_PaddedSignal), mem_size));

701. HANDLE_ERROR(cudaMalloc((void**)&(*D_ResidFilt_O), mem_size)); 702. } 703. 704. // Copy host memory to device 705. HANDLE_ERROR(cudaMemcpy((*D_PaddedSignal), H_PaddedSignal, mem_size,

cudaMemcpyHostToDevice)); 706. 707. if (DEBUG_ON) 708. printf("Residual Filter Padded Signal Size %d\n", new_size);

709. 710. HANDLE_ERROR (cudaFreeHost(H_PaddedSignal)); 711. 712. if (DEBUG_ON) 713. printf("Completed Residual FIR Data initialization\n"); 714. return; 715. 716. } 717. 718. void initPolyphaseData (Complex** H_X_8x, Complex** D_PaddedSignal, Comp

lex** D_Polyphase_O) 719. { 720. if (DEBUG_ON) 721. printf("Initializing Polyphase FIR Data\n"); 722. Complex* H_PaddedSignal; 723. 724. // pad data // 725. int new_size = fftPadData(*H_X_8x, &H_PaddedSignal, SIGNAL_SIZE, POL

YPHASE_COEFFS, POLYPHASE_PAD); 726. int mem_size = sizeof(Complex) * new_size; 727. 728. // Allocate DEVICE memory for Padded Signal 729. if (*D_PaddedSignal == NULL) 730. { 731. HANDLE_ERROR(cudaMalloc((void**)&(*D_PaddedSignal), mem_size));

732. HANDLE_ERROR(cudaMalloc((void**)&(*D_Polyphase_O), mem_size)); 733. } 734. 735. // Copy host memory to device 736. HANDLE_ERROR(cudaMemcpy((*D_PaddedSignal), H_PaddedSignal, mem_size,

cudaMemcpyHostToDevice)); 737.

99

738. if (DEBUG_ON) 739. printf("Padded Signal Size %d\n", new_size); 740. 741. HANDLE_ERROR (cudaFreeHost(H_PaddedSignal)); 742. 743. if (DEBUG_ON) 744. printf("Completed Polyphase FIR Data initialization\n"); 745. return; 746. } 747. 748. void initPLPIR (float** H_OutputSig, cufftHandle & PLPResidual, Complex*

* H_X, Complex** D_X, Complex** D_X_O, Complex** H_PLPIR, Complex** D_PLPIR, Complex** D_PLPIR_O)

749. { 750. if (DEBUG_ON) 751. printf("Initializing PLP IR FIR Data\n"); 752. 753. Complex* H_PaddedSignal; 754. 755. HANDLE_ERROR (cudaHostAlloc (H_OutputSig, sizeof(float)*M, cudaHostA

llocDefault)); 756. 757. if (cufftPlan1d (&PLPResidual, FILTER_W_SIZE, CUFFT_C2C, 1) != CUFFT

_SUCCESS) 758. fprintf(stderr, "ERROR UNABLE TO SETUP RESIDUALS FFT\n"); 759. 760. HANDLE_ERROR (cudaHostAlloc((void**)H_PLPIR, sizeof(Complex)*FILTER_

W_SIZE, cudaHostAllocDefault)); 761. HANDLE_ERROR (cudaMalloc(D_PLPIR, sizeof(Complex)*FILTER_W_SIZE)); 762. HANDLE_ERROR (cudaMalloc(D_PLPIR_O, sizeof(Complex)*FILTER_W_SIZE));

763. 764. // pad data // 765. int new_size = fftPadData(*H_X, &H_PaddedSignal, M, 1+(K_MAX), FILTE

R_W_SIZE-‐M); 766. int mem_size = sizeof(Complex) * new_size; 767. 768. // Allocate DEVICE memory for Padded Signal 769. HANDLE_ERROR(cudaMalloc((void**)&(*D_X), mem_size)); 770. HANDLE_ERROR(cudaMalloc((void**)&(*D_X_O), mem_size)); 771. 772. if (DEBUG_ON) 773. printf("Padded Signal Size %d\n", new_size); 774. 775. HANDLE_ERROR (cudaFreeHost(H_PaddedSignal)); 776. 777. if (DEBUG_ON) 778. printf("Completed Polyphase FIR Data initialization\n"); 779. return; 780. } 781. 782. 783. void delPLPIR (float** H_OutputSig, cufftHandle & PLPResidual, Complex**

D_X, Complex** H_PLPIR, Complex** D_PLPIR) 784. { 785. HANDLE_ERROR (cudaFreeHost (*H_PLPIR)); 786. HANDLE_ERROR (cudaFree (*D_X)); 787. HANDLE_ERROR (cudaFree (*D_PLPIR)); 788. HANDLE_ERROR (cudaFreeHost (*H_OutputSig)); 789. cufftDestroy(PLPResidual); // ADD ERROR CHECKING 790.

100

791. return; 792. } 793. 794. 795. // FUNCTIONS NO LONGER USED // 796. // Pad data 797. int fftPadData(const Complex* signal, Complex** padded_signal, int signa

l_size, int kernel_size, int PAD) 798. { 799. if (DEBUG_ON) 800. printf("Padding fft data vector\n"); 801. 802. int new_size = signal_size + PAD; 803. 804. // Pad signal 805. Complex* new_data; 806. HANDLE_ERROR ( cudaHostAlloc((void**)&new_data, sizeof(Complex)*new_

size, cudaHostAllocDefault)); 807. 808. memcpy(new_data + 0, signal, (signal_size) * si

zeof(Complex)); 809. memset(new_data + signal_size, 0, (new_size -‐

signal_size) * sizeof(Complex)); 810. 811. *padded_signal = new_data; 812. 813. if (DEBUG_ON) 814. printf("Completed padding fft data vector\n"); 815. 816. return new_size; 817. } 818. 819. // Pad Kernel 820. int fftPadKernel(const Complex* filter_kernel, Complex** padded_filter_k

ernel, int filter_kernel_size, int signal_size, int PAD) 821. { 822. if (DEBUG_ON) 823. printf("Padding fft kernel vector\n"); 824. 825. int new_size = signal_size + PAD; 826. 827. Complex* new_data; 828. 829. // Pad filter 830. //new_data = (Complex*)malloc(sizeof(Complex) * new_size); 831. HANDLE_ERROR ( cudaHostAlloc((void**)&new_data, sizeof(Complex)*new_

size, cudaHostAllocDefault)); 832. 833. memcpy(new_data + 0, filter_kernel,

(filter_kernel_size) * sizeof(Complex)); 834. memset(new_data + filter_kernel_size, 0, (new_siz

e -‐ filter_kernel_size) * sizeof(Complex)); 835. 836. *padded_filter_kernel = new_data; 837. 838. if (DEBUG_ON) 839. printf("Completed padding fft kernel vector\n"); 840. 841. return new_size; 842. } 843.

101

844. // Pad Kernel Centered 845. int fftPadDataCentered(const Complex* signal, Complex** padded_signal, i

nt signal_size, int kernel_size, int PAD) 846. { 847. if (DEBUG_ON) 848. printf("Padding fft data vector\n"); 849. 850. int minRadius = kernel_size / 2; 851. int maxRadius = kernel_size -‐ minRadius; 852. int new_size = signal_size + maxRadius + PAD; 853. int edge_pad = PAD/2; 854. 855. // Pad signal 856. //Complex* new_data = (Complex*)malloc(sizeof(Complex) * new_size);

857. Complex* new_data; 858. HANDLE_ERROR ( cudaHostAlloc((void**)&new_data, sizeof(Complex)*new_

size, cudaHostAllocDefault)); 859. memset(new_data + 0, 0, edge_pad * sizeo

f(Complex)); 860. memcpy(new_data + edge_pad, signal, (edge_pad + signal_size) * si

zeof(Complex)); 861. memset(new_data + signal_size, 0, (new_size -‐

signal_size) * sizeof(Complex)); 862. *padded_signal = new_data; 863. 864. if (DEBUG_ON) 865. printf("Completed padding fft data vector\n"); 866. return new_size; 867. } 868. 869. 870. // Pad Kernel Centered 871. int fftPadKernelCentered(const Complex* filter_kernel, Complex** padded_

filter_kernel, int filter_kernel_size, int signal_size, int PAD) 872. { 873. if (DEBUG_ON) 874. printf("Padding fft kernel vector\n"); 875. 876. int minRadius = filter_kernel_size / 2; 877. int maxRadius = filter_kernel_size -‐ minRadius; 878. int new_size = signal_size + maxRadius + PAD; 879. 880. Complex* new_data; 881. 882. // Pad filter 883. //new_data = (Complex*)malloc(sizeof(Complex) * new_size); 884. HANDLE_ERROR ( cudaHostAlloc((void**)&new_data, sizeof(Complex)*new_

size, cudaHostAllocDefault)); 885. memcpy(new_data + 0, filter_kernel + minRadius,

maxRadius * sizeof(Complex)); 886. memset(new_data + maxRadius, 0, (

new_size -‐ filter_kernel_size) * sizeof(Complex)); 887. memcpy(new_data + new_size -‐

minRadius, filter_kernel, minRadius * sizeof(Complex));

888. *padded_filter_kernel = new_data; 889. 890. if (DEBUG_ON) 891. printf("Completed padding fft kernel vector\n"); 892. return new_size;

102

893. } 894. 895. 896. /// RUN OPERATIONS 897. void runPolyphaseFFTConv (cufftHandle & Polyphase_FIR, Complex* D_Filter

Kernel, Complex* D_PaddedSignal, Complex* D_Polyphase_O, Complex* H_ConvolvedSignal, int LENGTH)

898. { 899. int 900. ERROR_TYPE = 0; 901. 902. if (DEBUG_ON) 903. printf("Running Polyphase FIR Convolution\n"); 904. 905. // Transform signal and kernel 906. ERROR_TYPE = cufftExecC2C(Polyphase_FIR, (cufftComplex *)D_PaddedSig

nal, (cufftComplex *)D_Polyphase_O, CUFFT_FORWARD); 907. if (ERROR_TYPE != CUFFT_SUCCESS) 908. fprintf(stderr, "FAILED to perform forward FFT of Padded Signal:

%d\n", ERROR_TYPE); 909. 910. if (PLP.RUNNING == 0) 911. if (cufftExecC2C(Polyphase_FIR, (cufftComplex *)D_FilterKernel, (cuf

ftComplex *)D_FilterKernel, CUFFT_FORWARD) != CUFFT_SUCCESS) 912. fprintf(stderr, "FAILED to perform forward FFT of padded Filter

Kernel\n"); 913. 914. 915. // Multiply the coefficients together and normalize the result 916. ComplexPointwiseMulAndScale<<<32, 256>>>(D_Polyphase_O, D_FilterKern

el, LENGTH, 1.0f / LENGTH); 917. 918. // Transform signal back 919. if (cufftExecC2C(Polyphase_FIR, (cufftComplex *)D_Polyphase_O, (cuff

tComplex *)D_Polyphase_O, CUFFT_INVERSE) != CUFFT_SUCCESS) 920. fprintf(stderr, "FAILED to perform inverse FFT of convolved spec

trum\n"); 921. 922. // Copy output from device memory to host 923. HANDLE_ERROR(cudaMemcpy(H_ConvolvedSignal, D_Polyphase_O, sizeof(Com

plex)*LENGTH, cudaMemcpyDeviceToHost)); 924. 925. 926. if (DEBUG_ON) 927. printf("Completed Polyphase FIR convolution\n"); 928. 929. return; 930. } 931. 932. 933. void findBestMSE (float* H_ResidualMatrix, int & lag, int & frac) 934. { 935. float 936. currentMSE = 9999; 937. 938. lag = 44; 939. frac = 0; 940. int i, j; 941. 942. 943. for (i=0; i < 8; ++i)

103

944. { 945. for (j=0; j < (K_MAX-‐K_MIN); ++j) 946. { 947. if (H_ResidualMatrix[i*(K_MAX-‐K_MIN)+j] < currentMSE) 948. { 949. currentMSE = H_ResidualMatrix[i*(K_MAX-‐K_MIN)+j]; 950. lag = j; 951. frac = i; 952. } 953. } 954. } 955. 956. if (VERBOSE) 957. printf ("Best MSE: %f, bulk delay: %d, frac delay: %d\n", currentMS

E, (lag+K_MIN), frac); 958. 959. return; 960. } 961. 962. 963. int calc3TapCoeff (Complex* PLP_IR, Complex* H_Convolved, Complex* H_X,

int lag, int frac) 964. { 965. int 966. size = M; 967. float 968. taps [3] = {0,0,0}; 969. float 970. a_M [3] = {0,0,0}; 971. float 972. b_0 [3] = {0,0,0}; 973. float 974. invMat [3][3] = { {0,0,0},{0,0,0},{0,0,0} }; 975. float 976. det_A = 0; 977. float 978. x_0 [M]; 979. float 980. x_Mminus [M]; 981. float 982. x_M [M]; 983. float 984. x_Mplus [M]; 985. 986. 987. for (int i=0; i < size; ++i) 988. { 989. x_0[i] = H_X[i+lag].x; 990. x_Mminus[i] = H_Convolved[POLYPHASE_BULK_DELAY+(lag+i-‐

1)*8+frac].x*8; 991. x_M[i] = H_Convolved[POLYPHASE_BULK_DELAY+(lag+i)*8+frac].x*8; 992. x_Mplus[i] = H_Convolved[POLYPHASE_BULK_DELAY+(lag+i+1)*8+frac].

x*8; 993. } 994. 995. // dot products 996. for (int i=0; i < size; ++i) 997. { 998. b_0[0] += x_0[i]*x_Mminus[i]; 999. b_0[1] += x_0[i]*x_M[i]; 1000. b_0[2] += x_0[i]*x_Mplus[i];

104

1001. a_M[0] += x_Mminus[i]*x_Mminus[i]; 1002. a_M[1] += x_Mminus[i]*x_M[i]; 1003. a_M[2] += x_Mminus[i]*x_Mplus[i]; 1004. } 1005. 1006. // manual matrix inverse 1007. det_A = a_M[0]*(a_M[0]*a_M[0] -‐

a_M[1]*a_M[1]) + a_M[1]*(a_M[1]*a_M[2] -‐ a_M[0]*a_M[1]) + a_M[2]*(a_M[1]*a_M[1] -‐ a_M[0]*a_M[2]);

1008. 1009. if (det_A == 0) 1010. { 1011. printf ("ERROR! SINGULAR MATRIX INVERSION!\n\n\n"); 1012. return 1; 1013. } 1014. 1015. if (DEBUG_ON) 1016. { 1017. printf ("b_0 \n%3.3f\n%3.3f\n%3.3f\n", b_0[0], b_0[1], b_0[2]);

1018. printf ("a_M \n%3.3f\n%3.3f\n%3.3f\n", a_M[0], a_M[1], a_M[2]);

1019. } 1020. 1021. invMat [0][0] = (1/det_A) *(a_M[0]*a_M[0] -‐ a_M[1]*a_M[1]); 1022. invMat [0][1] = (1/det_A) *(a_M[2]*a_M[1] -‐ a_M[1]*a_M[0]); 1023. invMat [0][2] = (1/det_A) *(a_M[1]*a_M[1] -‐ a_M[2]*a_M[0]); 1024. invMat [1][0] = (1/det_A) *(a_M[1]*a_M[2] -‐ a_M[1]*a_M[0]); 1025. invMat [1][1] = (1/det_A) *(a_M[0]*a_M[0] -‐ a_M[2]*a_M[2]); 1026. invMat [1][2] = (1/det_A) *(a_M[2]*a_M[1] -‐ a_M[0]*a_M[1]); 1027. invMat [2][0] = (1/det_A) *(a_M[1]*a_M[1] -‐ a_M[0]*a_M[2]); 1028. invMat [2][1] = (1/det_A) *(a_M[1]*a_M[2] -‐ a_M[0]*a_M[1]); 1029. invMat [2][2] = (1/det_A) *(a_M[0]*a_M[0] -‐ a_M[1]*a_M[1]); 1030. 1031. if (DEBUG_ON) 1032. { 1033. printf ("output\n"); 1034. printf ("%3.3f %3.3f %3.3f\n", invMat[0][0],invMat[0][1],invMat[

0][2]); 1035. printf ("%3.3f %3.3f %3.3f\n", invMat[1][0],invMat[1][1],invMat[

1][2]); 1036. printf ("%3.3f %3.3f %3.3f\n\n", invMat[2][0],invMat[2][1],invMa

t[2][2]); 1037. } 1038. 1039. 1040. for (int i=0; i < 3; ++i) 1041. for (int j=0; j < 3; ++j) 1042. { 1043. taps [i] += invMat[i][j]*b_0[j]; 1044. } 1045. 1046. if (DEBUG_ON) 1047. printf ("taps %3.3f %3.3f %3.3f\n", taps[0], taps[1], taps[2]);

1048. 1049. 1050. // best fit impulse response 1051. for (int i=0; i < FILTER_W_SIZE; ++i) 1052. { 1053. PLP_IR [i].x = 0.0f;

105

1054. PLP_IR [i].y = 0.0f; 1055. } 1056. 1057. PLP_IR[0].x = 1.0; 1058. for (int i = (K_MIN+lag)-‐INTERP_HALF_ORDER-‐

1; i < (K_MIN+lag)+INTERP_HALF_ORDER-‐1; ++i) 1059. PLP_IR [i].x = -‐taps[0]*in_IFw_taps[frac][i-‐(-‐1+(K_MIN+lag)-‐

INTERP_HALF_ORDER)]; 1060. for (int i = (K_MIN+lag)-‐

INTERP_HALF_ORDER; i < (K_MIN+lag)+INTERP_HALF_ORDER; ++i) 1061. PLP_IR [i].x -‐= taps[1]*in_IFw_taps[frac][i-‐((K_MIN+lag)-‐

INTERP_HALF_ORDER)]; 1062. for (int i = (K_MIN+lag)-‐

INTERP_HALF_ORDER+1; i < (K_MIN+lag)+INTERP_HALF_ORDER+1; ++i) 1063. PLP_IR [i].x -‐= taps[2]*in_IFw_taps[frac][i-‐(1+(K_MIN+lag)-‐

INTERP_HALF_ORDER)]; 1064. 1065. 1066. return 0; 1067. } 1068. int mynum =0; 1069. void runPLPConvandAutocorr (float* H_Autocorr, float* H_OutputSig, Compl

ex* D_PLPIR, Complex* D_PLPIR_O, Complex* D_X, Complex* D_X_O, cufftHandle & PLPResidual, int lag, int frac)

1070. { 1071. Complex* 1072. Output = NULL; 1073. Complex* 1074. D_Autocorr = NULL; 1075. int 1076. offset = K_MIN+lag+round((double)frac/8.0); 1077. 1078. HANDLE_ERROR (cudaMalloc (&D_Autocorr, sizeof(Complex)*FILTER_W_SIZE

)); 1079. HANDLE_ERROR (cudaHostAlloc (&Output, sizeof(Complex)*FILTER_W_SIZE,

cudaHostAllocDefault)); 1080. 1081. float* foo; 1082. Complex* foo2; 1083. if (mynum == 100) 1084. { 1085. printf("\n\nInput D_X\n"); 1086. HANDLE_ERROR (cudaHostAlloc((void**)&foo, sizeof(float)*FILTER_W_SIZ

E, cudaHostAllocDefault)); 1087. HANDLE_ERROR (cudaHostAlloc((void**)&foo2, sizeof(Complex)*FILTER_W_

SIZE, cudaHostAllocDefault)); 1088. HANDLE_ERROR (cudaMemcpy (foo2, D_X,sizeof(Complex)*FILTER_W_SIZE,cu

daMemcpyDeviceToHost)); 1089. for (int i=0; i < FILTER_W_SIZE; ++i) 1090. foo[i] = foo2[i].x; 1091. writeToFileOutput(&foo); 1092. getchar(); 1093. 1094. printf("\n\nInput D_PLPIR\n"); 1095. HANDLE_ERROR (cudaMemcpy (foo2, D_PLPIR,sizeof(Complex)*FILTER_W_SIZ

E,cudaMemcpyDeviceToHost)); 1096. for (int i=0; i < FILTER_W_SIZE; ++i) 1097. foo[i] = foo2[i].x; 1098. writeToFileOutput(&foo); 1099. getchar(); 1100. }

106

1101. 1102. // residual convolution // 1103. if (cufftExecC2C(PLPResidual, (cufftComplex *)D_X, (cufftComplex *)D

_X_O, CUFFT_FORWARD) != CUFFT_SUCCESS) 1104. fprintf(stderr, "FAILED to perform forward FFT of padded Filter

Kernel\n"); 1105. if (cufftExecC2C(PLPResidual, (cufftComplex *)D_PLPIR, (cufftComplex

*)D_PLPIR_O, CUFFT_FORWARD) != CUFFT_SUCCESS) 1106. fprintf(stderr, "FAILED to perform forward FFT of padded Filter

Kernel\n"); 1107. 1108. // Multiply the coefficients together and normalize the result 1109. ComplexPointwiseMulAndScale<<<32, 256>>>(D_X_O, D_PLPIR_O, FILTER_W_

SIZE, 1.0f / FILTER_W_SIZE); 1110. 1111. HANDLE_ERROR (cudaMemcpy (D_Autocorr, D_X_O, sizeof(Complex)*FILTER_

W_SIZE, cudaMemcpyDeviceToDevice)); 1112. 1113. FFTAutocorr <<<32, 256>>> (D_Autocorr, FILTER_W_SIZE); 1114. 1115. // Transform signal back 1116. if (cufftExecC2C(PLPResidual, (cufftComplex *)D_X_O, (cufftComplex *

)D_X_O, CUFFT_INVERSE) != CUFFT_SUCCESS) 1117. fprintf(stderr, "FAILED to perform inverse FFT of convolved spec

trum\n"); 1118. 1119. if (cufftExecC2C(PLPResidual, (cufftComplex *)D_Autocorr, (cufftComp

lex *)D_Autocorr, CUFFT_INVERSE) != CUFFT_SUCCESS) 1120. fprintf(stderr, "FAILED to perform inverse FFT of autocorr signa

l\n"); 1121. 1122. // Copy output from device memory to host 1123. HANDLE_ERROR(cudaMemcpy(Output, D_X_O, sizeof(Complex)*FILTER_W_SIZE

, cudaMemcpyDeviceToHost)); 1124. 1125. if (mynum == 100) 1126. { 1127. printf("\n\nOutput\n"); 1128. HANDLE_ERROR (cudaMemcpy (foo2, D_X_O,sizeof(Complex)*FILTER_W_SIZE,

cudaMemcpyDeviceToHost)); 1129. for (int i=0; i < FILTER_W_SIZE-‐512; ++i) 1130. foo[i] = foo2[i+offset].x; 1131. writeToFileOutput(&foo); 1132. getchar(); 1133. } 1134. 1135. for (int i=0; i < FILTER_W_SIZE-‐512; ++i) 1136. { 1137. H_OutputSig[i] = Output[i+offset].x; 1138. } 1139. 1140. // Copy output from device memory to host 1141. HANDLE_ERROR(cudaMemcpy(Output, D_Autocorr, sizeof(Complex)*FILTER_W

_SIZE, cudaMemcpyDeviceToHost)); 1142. 1143. if (cudaThreadSynchronize () != cudaSuccess) 1144. printf ("SOMETHING WENT WRONG!\n"); 1145. 1146. for (int i=0; i < (LP_ORDER+1); ++i) 1147. { 1148. H_Autocorr[i] = Output[i].x/M;

107

1149. } 1150. 1151. HANDLE_ERROR (cudaFreeHost(Output)); 1152. HANDLE_ERROR (cudaFree (D_Autocorr)); 1153. 1154. return; 1155. } 1156. 1157. // function copied from elsewhere 1158. void chkCudaReturn(cudaError_t err, unsigned int myErrLoc) 1159. { 1160. if (!err == cudaSuccess) 1161. { 1162. printf("\a\a\n***ERROR CUDA ERROR %u\n", myErrLoc); 1163. printf("Error Val %u\n",err); 1164. printf(cudaGetErrorString(err)); 1165. } 1166. } 1167. 1168. // function copied from elsewhere 1169. void printMemUsage () 1170. { 1171. 1172. size_t free_byte ; 1173. size_t total_byte ; 1174. cudaError_t cuda_status = cudaMemGetInfo( &free_byte, &total_byt

e ) ; 1175. if ( cudaSuccess != cuda_status ){ 1176. printf("Error: cudaMemGetInfo fails, %s \n", cudaGetErrorStr

ing(cuda_status) ); 1177. exit(1); 1178. } 1179. 1180. double free_db = (double)free_byte ; 1181. double total_db = (double)total_byte ; 1182. double used_db = total_db -‐ free_db ; 1183. printf("GPU memory usage:\t used = %3.2f, free = %3.2f MB, total

= %3.2f MB\n", 1184. used_db/1024.0/1024.0, free_db/1024.0/1024.0, total_db/1024.

0/1024.0); 1185. 1186. return; 1187. 1188. } 1189. 1190. 1191. ////////////////////////////////////////////////////////////////////////

//////// 1192. // Complex operations Kernels 1193. // FUNCTIONS BELOW ALL EXECUTE ON THE GPU 1194. ////////////////////////////////////////////////////////////////////////

//////// 1195. 1196. // Complex addition 1197. static __device__ __host__ inline Complex ComplexAdd(Complex a, Complex

b) 1198. { 1199. Complex c; 1200. c.x = a.x + b.x; 1201. c.y = a.y + b.y; 1202. return c;

108

1203. } 1204. 1205. // Complex scale 1206. static __device__ __host__ inline Complex ComplexScale(Complex a, float

s) 1207. { 1208. Complex c; 1209. c.x = s * a.x; 1210. c.y = s * a.y; 1211. return c; 1212. } 1213. 1214. // Complex multiplication 1215. static __device__ __host__ inline Complex ComplexMul(Complex a, Complex

b) 1216. { 1217. Complex c; 1218. c.x = a.x * b.x -‐ a.y * b.y; 1219. c.y = a.x * b.y + a.y * b.x; 1220. return c; 1221. } 1222. 1223. static __device__ __host__ inline Complex ComplexConjMul(Complex a, Comp

lex b) 1224. { 1225. Complex c; 1226. c.x = a.x * b.x + a.y * b.y; 1227. c.y = -‐a.x * b.y + a.y * b.x; 1228. return c; 1229. } 1230. 1231. 1232. static __global__ void FFTAutocorr (Complex* a, int length) 1233. { 1234. const int numThreads = blockDim.x * gridDim.x; 1235. const int threadID = blockIdx.x * blockDim.x + threadIdx.x; 1236. 1237. for (int i = threadID; i < length; i += numThreads) 1238. a[i] = ComplexScale(ComplexConjMul (a[i],a[i]), FILTER_W_SIZE);

// when using already convolved in Fourier domain 1239. } 1240. 1241. 1242. // Complex pointwise multiplication 1243. static __global__ void ComplexPointwiseMulAndScale(Complex* a, const Com

plex* b, int size, float scale) 1244. { 1245. const int numThreads = blockDim.x * gridDim.x; 1246. const int threadID = blockIdx.x * blockDim.x + threadIdx.x; 1247. for (int i = threadID; i < size; i += numThreads) 1248. a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale); 1249. } 1250. 1251. static __global__ void ResidualComplexPointwiseMulAndScale(Complex* a, c

onst Complex* b, int size, float scale) 1252. { 1253. const int numThreads = blockDim.x * gridDim.x; 1254. const int threadID = blockIdx.x * blockDim.x + threadIdx.x; 1255. for (int i = threadID; i < size; i += numThreads) 1256. a[i] = ComplexScale(ComplexMul(a[i], b[i%FILTER_W_SIZE]), scale)

;

109

1257. } 1258. 1259. // 1 tap predictor residual 1260. static __global__ void ResidualCalc (Complex* D_Residuals, float* D_Resi

dualMatrix) 1261. { 1262. const int bulk_delay = threadIdx.x; 1263. const int frac_delay = blockIdx.x; 1264. int offset = K_MIN+bulk_delay; 1265. 1266. float 1267. dotp = 0.0f; 1268. 1269. //#pragma unroll 16 // useful! 1270. for (int k=offset; k < (M-‐512+offset); ++k) 1271. dotp += (D_Residuals[frac_delay*i_INC+bulk_delay*j_INC+k].x*D_Re

siduals[frac_delay*i_INC+bulk_delay*j_INC+k].x); 1272. 1273. __syncthreads (); 1274. 1275. D_ResidualMatrix[frac_delay*(K_MAX-‐K_MIN)+bulk_delay] = dotp; 1276. } 1277. 1278. 1279. static __global__ void findOne_Tap_Coeffs (Complex* D_X, Complex* D_Poly

Out, float* D_ResidualFilterTap, Complex* D_ResidualFilterVectors) 1280. { 1281. const int bulk_delay = threadIdx.x; 1282. const int frac_delay = blockIdx.x; 1283. 1284. const int data_size = M -‐ (bulk_delay+K_MIN); 1285. 1286. float 1287. autocorr_0M = 0.0f; 1288. float 1289. autocorr_MM = 0.0f; 1290. float 1291. tap = 0.0; 1292. int 1293. vectors_offset; 1294. int 1295. x_M_offset; 1296. int 1297. x_0_offset; 1298. 1299. 1300. x_0_offset=(bulk_delay+K_MIN); 1301. x_M_offset=POLYPHASE_BULK_DELAY+frac_delay; 1302. 1303. // autocorr dot prod 1304. for (int i=0; i < data_size; ++i) 1305. autocorr_0M += (D_X[x_0_offset+i].x*(D_PolyOut[x_M_offset+(i*8)]

.x)*8); 1306. for (int j=0; j< data_size; ++j) 1307. autocorr_MM += (D_PolyOut[x_M_offset+(j*8)].x*D_PolyOut[x_M_offs

et+(j*8)].x)*8*8; 1308. 1309. tap = autocorr_0M / autocorr_MM; 1310. D_ResidualFilterTap[(bulk_delay*8)+frac_delay] = tap; 1311. 1312. __syncthreads();

110

1313. 1314. vectors_offset = 1+(bulk_delay+K_MIN)-‐INTERP_HALF_ORDER; 1315. 1316. for (int e=vectors_offset; e < (vectors_offset+2*INTERP_HALF_ORDER);

++e) { 1317. D_ResidualFilterVectors[frac_delay*i_INC+bulk_delay*j_INC+e].x =

1318. -‐tap*D_IFw_Taps[frac_delay*(2*INTERP_HALF_ORDER)+e-‐

(vectors_offset)]; 1319. } 1320. 1321. 1322. return; 1323. } 1324.

Real-Time Graphics Processing Unit Implementation

Documents

Real-Time Graphics Processing Unit Implementation