2013 ‐‐ ENCM515 Assignment and Laboratory Detailspeople.ucalgary.ca/.../13_Labs/13_Assign1Details_LabOverview.pdf · 2013 ‐‐ ENCM515 Assignment and Laboratory Details ...

2013 ‐‐ ENCM515 Assignment and Laboratory Details

Assignments are INDIVIDUAL and are intended to put you into a position to be able to answer

midterm exam, final exam and job interview questions.

Laboratory work is TEAM and is intended to allow you to work with somebody else to solve the issues of

designing, writing and testing high speed, highly optimized, programs for the SHARC in C++ and assembly code.

Also cuts down the total amount of writing and marking

Smith 20 minute rule as applied to assignments. It is perfectly acceptable to discuss all the details of the

assignments with your laboratory partner and other members of the class. However no code (paper or

electronic) may be taken away from the conversation. You need to commit the idea to long term memory and

then do the assignment on your own.

I don’t even mind if you apply the special –WP option. The –WP, or will power, option means

that you and your laboratory partner are permitted to work through all the assignment ideas together, provided

you have the will power to delete all the joint developed code, copying nothing from the laboratory code, and

start your own design from scratch. Your partner is not going to be able to help you during midterm and finals

Note that Midterm 1 (just after reading week) assumes that you have completed and

understand Assignment 1, Lab. 1 and Lab. 2

Note that Midterm 2 (end of March) assumes that you have completed and

understand Assignment 2 and Lab. 3

Reference material in midterm and final

You will be able to bring in SHARC reference sheet to midterms and final

We need to discuss about whether it makes sense to allow you to bring in copies

of your own assignment code into midterm and final. Would change the sort of

questions that can be asked during the exam.

Hunting the Ranchlands Hum

In terms of course work – Look at implementing very sharp (narrow bandwidth), real time, digital filters in

the time domain and frequency domain

Time domain Averaging filter in C++, assembly, optimized assembly, optimized C++ will be discussed

in detail in class

Time domain FIR filter in C++, assembly, optimized assembly, optimized C++ ‐‐ basically your

responsibility to implement in the labs and assignments; discussed in class

No code writing needed. ‐‐ Frequency domain sharp filter using discrete Fourier transform

implemented using fast Fourier transform (FFT) algorithm.

o I will provide you with a (slow) version in C++ which you can allow the C++ compiler to

optimize – half the teams.

o Use the Analog devices assembly language versions provide in VDSP 21469 directory – half

the teams.

o Share timing results across teams

No code writing needed. ‐‐ Half the lab teams will replace their FIR code with the FIR coprocessor

code available from VDSP 21469 directory. The other group of lab teams will replace their FFT code

with the FFT code available in VDSP 21469 directory.

o The idea is: Get your FIR code working on one channel and the coprocessor working in

parallel.

o We may do this during tutorial time rather than lab time.

No code writing needed. NO MARKS ‐‐ Pure personal satisfaction so that we can say – we have a

DSP cell‐phone application running. Discussed in class at the end of January

o Move the C++ FIR code (or FFT code if faster) over onto a pre‐existing cell phone application,

and make it run.

I have a windows mobile 5 application running

If you have Apple or Android experience, try it out there

Possibility of turning this into a class magazine article and making some money –

looks good on grad scholarship and industrial CVs

o Maximum of two hours to make it run, can work in teams of 4.

Alternately, somebody might want to take this on as a course project and let

everybody else have the code. Possibility of turning this into a class magazine

article and making some money – looks good on grad scholarship and industrial CVs

Project requirements

I have hum recordings available. However if you are interested, I have some pre‐amps and microphones

you can attach to cell‐phones and go and get recordings from your own house

Houses are noisy – hence recording will have noise as well as hum

o In particular you will pick up sounds at 60 Hz (light fitting etc) and 60 Hz harmonics (120 Hz, 180

Hz and the like)

We know that sometimes the Hum is at 42 Hz; but other times at 44 Hz (in different houses)

High frequency analysis (10 or more seconds of data) of the recordings show a 19.2 Hz signal and a 3rd

harmonic (19.2 * 3 Hz).

o Sometimes together, sometimes 19 Hz there then the 57.6 Hz there; oscillating in intensity.

o We need sharp filter to distinguish between 57.6Hz and 60 Hz signals

NICE TO HAVE ‐‐ During the day the traffic noise makes the hum difficult to hear. So we can play back

the hum recording (or current hum signals) using the sharp filters

NICE TO HAVE – If other strong signals there – provide an indication, perhaps store for later analysis.

Basic analysis:

Sometimes easier to analyze algorithms in frequency domain, and apply the results in time domain.

We are sampling at 96000 Hz on 21469 processor with provided audio code

To get a frequency resolution of ½ Hz with FFT (distinguish between 57.6 Hz and 60 Hz), we need 2

seconds of data or 200,000 points.

o I am guessing that to get a time domain FIR filter working we need a FIFO buffer of 200,000

coefficients (commonly called taps)

IN YOUR INDIVIDUAL ASSIGNMENT ONE REPORT, associated with Lab 0

o Do a 21469 resource chart analysis (6 lines) of a (time domain) FIR filter to see how many cycles

(un‐optimized) it takes to do one tap

You don’t need to have working code to do this analysis; the code will be discussed in

class and is also in that IEEE Micro article on the web‐page and discussed in detail on the

ECE‐ADI project web pages

o Work out how many cycles it would need to perform a 200,000 FIR filter if you can find all

possible optimizations improvement

That’s add, multiply, two memory fetches on both 21469 ALUs (dual mode) in one

instruction cycle if using hardware loop – tackled next week and also done by compiler

o Assume using float point operations – still single cycle on this processor

o Prove that this algorithm will not give real time performance at 96000 Hz sampling rate on a

500 MHz 21469 processor. (What is the actual processor clock rate of the 21469?)

SO HOW CAN WE DO IT?

TIME DOMAIN APPROACH: Something like this

Develop the FIR code and test using Embedded unit

Adjust number of taps to 1024 (FIFO buffer size)

Inside the audio processing routine – TTCOS_AddPremptiveTask( ) ‐‐ TTCOS interrupt service routine call‐

back function

o Grab one new value, put into ISR FIFO input array

o Perform FIR and put output into ISR FIFO input array

o Every 32 input sample, calculate the average of last 32 filtered examples and put in new external

FIFO array of size 256 (Numbers may need adjusting)

Every time there are 64 new values in the external array, the ISR call back function launches a run‐once

TTCOS_AddTask (runs outside ISR – so can run slowly) which filters the output array using a second copy of

the FIR code.

o The output of this second FIR filter is a filtered version of a down sampled filtered version of the

original signal using just FIR code and averaging code routines

It is this filtered output we analyse as follows

o The original filtering and down sampling give us a bandwidth of around 300 Hz

o On the down‐sampled filtered signal – we use a series of sharp FIR filters to detect signals in 18 to

20Hz range, 40 to 44 Hz range, 56 to 59 Hz range.

Matlab has programs to give use the filter coefficients automatically.

o Personally I would try the code out in Matlab (about 10 lines of code) to make sure that the

algorithm works before I move over to writing C++. We will discuss some possible code in class

FREQUENCY DOMAIN APPROACH: So much easier to understand – also FFT is faster (discussed in class)

Inside TTCOS_AddPremptiveTask( ) capture 2048 points into buffer1

o Launch a run‐once TTCOS_AddTask (not pre‐emptive) TASKA when buffer is full and start filling

buffer 2 inside ISR so don’t miss any audio values

TASKA must complete within 2048 x 1/ 96000 second (before next buffer is full)

o Do 2048 point FFT. There are now 2048 frequency points

o Keep frequency points associated with low frequencies (set everything else to zero) – that’s how

simple filtering is in time domain

o Do inverse FFT (IFFT) back to time domain

o Take every 128th point from filtered signal and put in new array of size 256 (thats the down‐

sampling). Every time that new buffer is full; launch another run‐once TTCOS_AddTask TASK B

TaskB is identical to TASKA (actually can be the same task, because TTCOS is a co‐operative scheduler)

except runs on a different buffer

o Except that this code only runs the FFT part and not the IFFT part. We can just pick out the 19 Hz, 42

Hz, 57 Hz signals directly from the frequency information :‐)

o Again try it out in Matlab ‐‐ We will discuss some possible code in class

NOTE: ALL THIS IS DONE INSIDE TTCOS ‐‐ Operating system with just 3 pieces of code – Average, FIR

and FFT

Assignment 1 due: Includes audio demos and testing and a discussion on your timing report. Due 6th Feb.

Combine your team timing results with your individual timing results. Include documented code and a

comparison between the expected performance (theoretical) and actual performance.

There is a single Lab. 1 / Lab. 2 report – common to both people in the team. Includes audio demos and testing.

This is due 2 weeks after Lab. 2 is held (excluding reading week). Include documented code and a comparison

between the expected performance (theoretical) and actual performance – include resource charts and the like.

Details to come.

Assignment 2 (Individual): Using the optimization techniques learnt during Lab. 1, Lab. 2 and Lab. 3 to develop

DSP code that works efficiently on “blocks” of data (double buffering and circular buffers) and executes

“outside” of ISR. Includes audio demos and testing and a discussion of your timing report. Include documented

code and a comparison between the expected performance (theoretical) and actual performance.

There is a single Lab. 3 / Lab. 4 report – common to both people in the team. Includes audio demos and testing

and timing report. This is due 1 weeks after Lab. 4 is held. Include documented code and a comparison between

the expected performance (theoretical) and actual performance.

OVERVIEW – More details are given later.

Introductory Laboratory (Team) / Assignment 1 (Individual): Basic practice with SHARC C++, assembly code

combined with real‐time validation for a simple routine (Averaging). Includes audio demos and testing and a

discussion on your timing report. Due 6th Feb. Combine your team timing results with your individual timing

results. Include documented code

Lab. 1 (Team): – Introduction to a structured process for developing DSP algorithms and comparing

performance of C (debug and release mode) and assembly code development (software loops and hardware

loop optimization only).

Lab. 2 (Team): – Assembly code optimization including dual data access and comparing performance when

‘instruction cache thrash’ does not occur and does occur. C++ optimization to include ‘function in‐lining’ and

demonstrating the use of the SHARC profiler to identify ‘where the code spends most of its time’. This approach

identifies the code that you will spend most of your time optimizing.

Lab. 3 (Team): Design and then implement further optimizations of your FIR algorithm. Compare using C++ DSP

language extensions (minimum of 4 extensions – use mixed mode (screen capture) demonstrate activation of

SIMD mode, ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions (simultaneous add and

multiple operations)) when in debug and release mode. Design and implement an optimized assembly language

version demonstrating ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions and SIMD

mode. Compare and contrast performance

Lab. 4 (Team): Using the example code available with VDSP, compare the performance of the hardware ‘FIR’

DSP accelerator on the SHARC with your ‘best’ FIR implementation using the C++ compiler and your ‘best’ FIR

implementation in assembly code.

DETAILS ‐‐ We are trying to demonstrate that we understand why the way we write our C++ code and

assembly code for an implementation impacts the speed of the code.

IMPORTANT: If we really wanted to use these techniques ‘properly’ we would need complex algorithms to

implement. Each lab would take weeks. So instead we will demonstrate the ideas on simple algorithms where

the techniques are probably over‐kill, but at least do‐able in the time available.

IMPORTANT – Most of the changes involve cut‐and‐paste of existing code. However it is very important that

each C++ version of the function Average( ) or FIR( ) is in a different file, different data buffer names and loop

variable names. If this is not done, then the C++ compiler will use the wrong information and the results will be

meaningless. The details in Assignment 1 will show how to solve this effect.

IMPORTANT: Remember – timing tests must run with long averaging filters to match theory

For example

Step 1: ‐‐ we write a simple version of the averaging filter discussed in class – AveragingFilter_UnoptimisedCPP( )

and place in a file AveragingFilter_UnoptimisedCPP.cpp. We test, and time and use inside TTCOS with the

compiler running in debug mode

Step 2A: ‐‐ Make a copy of the file AveragingFilter_UnoptimisedCPP.cpp as AveragingFilter_OptimisedCPP.cpp

and change the name of the function to AveragingFilter_OptimisedCPP( ) – add to the project

Step 2B: ‐‐ Make copies of all the existing tests for AveragingFilter_UnoptimisedCPP( ) (same test file) and

rewrite the tests to call AveragingFilter_OptimisedCPP( ). Should give the same results as Step2A – simply a

test to show you have done the code renaming correctly

Step 2C: ‐‐ Important: Right click on AveragingFilter_OptimisedCPP.cpp – Select file options – select custom of

is it file specific, then select ‘optimize mode’. If you run the timing tests now, you should get much faster

performance that matches your theoretical results. – I will show you how to do this step during the lab.

Step 2D: Use mixed mode to see the optimized assembly code the compiler generated – see if there are any

special instructions that we need to learn to use

Step 3A: ‐‐ Make copies of all the existing tests for AveragingFilter_UnoptimisedCPP( ) (same test file) and

rewrite the tests to call AveragingFilter_UnoptimizedASM( ). Build a stub version of the function

AveragingFilter_UnoptimisedASM( ) and place in a file called AveragingFilter_UnoptimisedASM.asm. If you

don’t follow this instruction exactly, you will find everything breaks. Explain why this is a problem as part of

your assignment 1 report.

Step 3B: Compile and link the new tests and stub. This shows you have the right names.

Step 3C: Run the assembly code tests – they should fail, but at least this shows that you have set the code up

correctly.

Step 3D: Write the unoptimized assembly code and run tests and timing. WAIL on optimized assembly code.

Step 4A: Put algorithm into resource chart – find out what the theoretical unoptimized and optimized code

speeds should be.

Step 4B: Short report on timing (compare theory and actual; your assembly code and compiler code). What

techniques does the C compiler show you need to learn.

Lab. 0: (Team): Demonstrate you can use VDSP, TTCOS and EmbeddedUnit on C++ and assembly code

environments. Set up things for the hum analysis Design, implement a simple function in C++ and assembly

code. Using the testing framework show that all versions of the algorithm work correctly. Compare and contrast

the speed performance of the algorithm in C++ (debug and release) and assembly code (un‐optimized and using

software loop) with the ‘expected’ speed for the way you wrote your C code and assembly code assuming that

each SHARC instruction executes in 1 cycle. Show that the averaging function processing audio date works with

data lengths of 256 to 1024. Definition of working: ‐‐ Algorithm completes execution in 10% of the time for 1

audio sample period (1/41000).

Requirements:

Demonstrate EmbeddedUnit Knowledge – Get code to work and generate timing (compare to theory) for

unoptimized and optimized C++ code)

A) Set up a C++ task void Mock_42Hz_HumAnalysisAlgorithmCPP(void) with the following properties.

1) Each time we call (enter) the code there is a probability PEXISTS that the 42 Hz signal exists; if so turn

on the 42Hz warning flag (external global variable semaphore present42Hz – use typedef

semaphore unsigned int ). Use VDSP help to find out how to use rand( ) which is a standard part of

the C++ library.

2) Each time we enter the code – we will turn off the warning signal if

X seconds of Hum has passed and there is a probability PGONE that the signal no longer exists

3) Should be no more than about 10 lines of code; and some significant part of the code should be in a

subroutine. (Having stuff in a subroutine gives us the opportunity to see how the C++ compiler can /

can’t optimize subroutine calls)

4) Call the routine TTCOS2011_WriteLED(int value) with a value 1 to turn on LED1 if hum present. You

can’t see this happening until we burn the flash memory on an external board. However you can

check that it occurs by calling unsigned int TTCOS2011_ReadLED( ).

5) Once the code is working, duplicate the code so that have mock analysis tasks for the 19 Hz, 44Hz

and 57 Hz hums (same code; different names and probabilities)

B) Set up a C++ task void SetUpSineArrayCPP(float *aSineWave, int arrayLength) which fills the array

aSineWave with exactly one sine wave cycle in arrayLength points. This allows to actually set up arrays in

different poart of memory so we can practice some of the fancy go‐fast technichnique)

C) Set up a C++ task void Mock_GenerateSonic_AlarmsCPP(void)

Each time we enter –

a. copy input audio channel 1 left to output channel 1 left

b. Use the code from class generate an average, ClassAverageFilterCPP( ); of channel 1 left input and

output to channel 1 right

c. If semaphore present42Hz is equal to 1 grab values from SineWaveArray1 and output to channel 2

left output. Move through the array so that if this routine was called at 96000 Hz, it would generate

a 200 Hz tone. Watch out for volume issues as when we use the code fopr real – we want to be able

to hear at the end of the course

d. NICE TO HAVE ‐‐ More pleasant if we have overtones – meaning use generate 200 Hz with a little

(1/4) of the first overtone (400 Hz) plus 1 /16 of second overtone (third harmonic)

e. Once parts c and d are working – duplicate for semaphore present57Hz which grabs values from

SineWaveArray2 and ouputs to channel right output a 275 Hz tone

D) Lab 0: Team – use the assembly code given in class, generate an assembly language version of the average

routine ClassAverageFilterASM( );

Assignment 1: Individual – Generate the (un‐optimized ) assembly code for the

Mock_GenerateSonic_AlarmsASM( )

Assignment 1: Individual – Using the code from class generate a C++ routine float FIRfilter(float *FIFO, float

* FIRcoffs, int FIR length) which uses pointer arithmetic to access external arrays and returns the filtered

result. NOTE: using external arrays by pass by subroutine calls allows us to use one set of filter code for

many different channels (if careful)

Assignment 1: Individual – Using the code from class generate a C++ routine float FIRfilter(float FIFO[ ]

float FIRcoffs[ ], int FIR length) which uses array arithmetic to access external arrays and returns the filtered

result.

TTCOS demonstration

1) As you get the code working, then port to the TTCOS environment and try it for real

2) Lab 0 – just need to demo that tests work and audio works

3) Because we are pushed for time in getting ready for Lab 1

Assignment 1 – hand in documented code that test work and demo next lab – due by 6th Feb

Timing analysis (theory versus practical) we will make due the following week (13th Feb)

Make the timing analysis for the C++ FIR filter part of Lab. 1

OTHER USEFUL INFORMATION:

Pseudo code for C++ and assembly code

Define the length of the filter as a globally defined constant in both C++ and assembly code.

e.g.

#define PREDEFINED_N 256

float FIFO_ Simple_Average[ ]

void Simple_Average(void) {

FIFO_ Simple_Average filter update

Add new value

for (I = 0; I < PREDEFINED_N ; I++)

do average of FIFO_ Simple_Average

}

Express timing in terms of processing cycles / point and not total execution time. Do timing with N a nice

number (e.g. power of 2) and not a nice number (e.g. 241). Expected difference – C++ release slightly better for N

a nice number.

Place tests in 1 file, place averaging filter in a different file. Discuss (compare and contrast) the speed impact of

passing parameters C++ and assembly code Code written with no optimization in C++ (debug mode) and then

allowing the C++ compiler to optimize the code (release mode).

Separate file

#define PREDEFINED_N_V1 256 and 243

float FIFO_ AveragePass_Parameters_V1 [1024 ]

float AveragePass_Parameters_V1(float newValue,

float FIFO [ ], int FIFO_length);

TEST( ) {

float inValue = XXX;

Time a 1000 calls of

AveragePass_Parameters_V1(inValue,

FIFO_ AveragePass_Parameters_V1,

PREDEFINED_N_V1)

Test (CHECK) code works

//******************

#define PREDEFINED_N_V2 256 and 243

Separate file ‐‐ index notation for accessing array


float FIFO [ ], int FIFO_length) {

FIFO filter update

Add new value

for (I = 0; I <FIFO_length ; I++)

do average of FIFO[ ]

// Use index notation for accessing array

}

Separate file ‐‐ post‐increment pointer notation for accessing array


float FIFO_ AveragePass_Parameters_V2 [1024 ]


float *FIFO, int FIFO_length);

TEST( ) {

float inValue = XXX;

Time a 1000 calls of

AveragePass_Parameters_V1(inValue,

FIFO_ AveragePass_Parameters_V2,

PREDEFINED_N_V2)

Test (CHECK) code works

float *FIFO, int FIFO_length) {

FIFO filter update

Add new value

for (I = 0; I < FIFO_length ; I++)

do average of FIFO [ ]

// Use post‐increment pointer notation for accessing array

}

Expectations for timing: This is discussed in the articles ‘The byte of the SHARC’ which can be downloaded from the

ENCM515 web site

C debug mode – index access to array will be slower than post‐increment access

C release mode – if the compiler ‘is good’, it will switch automatically to the faster way of accessing code (meaning it might

do indexing for the FIFO update and increment mode for the averaging). Examine the code in MIXED mode to see how well

the compiler is optimizing

Assembly code – I would expect that your code will be slower by around 4 * N cycles (4 cycles per point) if you code using

index mode compared using array mode. NOTE most of your code will be cut‐and‐paste – so check that your code works

Lab. 1 (Team): – Introduction to a structured process for developing DSP algorithms and comparing

performance of C (debug and release mode) and assembly code development (software loops and hardware

loop optimization only).

Basically a repeat of the ideas for assignment 1 except that you are using an FIR filter rather than an averaging

filter

Expectation: code slower by 2 * N cycles (2 cycles / point) than averaging as you need to an additional memory

access and a multiply

Lab. 2 (Team): – Assembly code optimization including dual data access and comparing performance when

‘instruction cache thrash’ does not occur and does occur. C++ optimization to include ‘function in‐lining’ and

demonstrating the use of the SHARC profiler to identify ‘where the code spends most of its time’. This approach

identifies the code that you will spend most of your time optimizing.

Will demonstrate code in lining in class – expectation – reduce time for 1000 calls to be around 1000 * time for

subroutine (1000 * 20 cycles) plus time to set up and use parameters (1000 * 5 cycles). Very important if

subroutines that are called from inside a loop are short (FIR length small) .

Doing dual access (dm and pm in C++, and I4, I12 in assembly code) will improve speed by around N cycles (1

cycle / point) but may be faster than that as the C compiler might start putting instructions in delay slots (out of

order execution)

Lab. 3 (Team): Design and then implement further optimizations of your FIR algorithm. Compare using C++ DSP

language extensions (minimum of 4 extensions – use mixed mode (screen capture) demonstrate activation of

SIMD mode, ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions (simultaneous add and

multiple operations)) when in debug and release mode. Design and implement an optimized assembly language

version demonstrating ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions and SIMD

mode. Compare and contrast performance

Lab. 4 (Team): Using the example code available with VDSP, compare the performance of the hardware ‘FIR’

DSP accelerator on the SHARC with your ‘best’ FIR implementation using the C++ compiler and your ‘best’ FIR

implementation in assembly code.

2013 ‐‐ ENCM515 Assignment and Laboratory Detailspeople.ucalgary.ca/.../13_Labs/13_Assign1Details_LabOverview.pdf · 2013 ‐‐ ENCM515 Assignment and Laboratory Details ...

Documents