2013 ‐‐ ENCM515 Assignment and Laboratory Details Assignments are INDIVIDUAL and are intended to put you into a position to be able to answer midterm exam, final exam and job interview questions. Laboratory work is TEAM and is intended to allow you to work with somebody else to solve the issues of designing, writing and testing high speed, highly optimized, programs for the SHARC in C++ and assembly code. Also cuts down the total amount of writing and marking Smith 20 minute rule as applied to assignments. It is perfectly acceptable to discuss all the details of the assignments with your laboratory partner and other members of the class. However no code (paper or electronic) may be taken away from the conversation. You need to commit the idea to long term memory and then do the assignment on your own. I don’t even mind if you apply the special –WP option. The –WP, or will power, option means that you and your laboratory partner are permitted to work through all the assignment ideas together, provided you have the will power to delete all the joint developed code, copying nothing from the laboratory code, and start your own design from scratch. Your partner is not going to be able to help you during midterm and finals Note that Midterm 1 (just after reading week) assumes that you have completed and understand Assignment 1, Lab. 1 and Lab. 2 Note that Midterm 2 (end of March) assumes that you have completed and understand Assignment 2 and Lab. 3 Reference material in midterm and final You will be able to bring in SHARC reference sheet to midterms and final We need to discuss about whether it makes sense to allow you to bring in copies of your own assignment code into midterm and final. Would change the sort of questions that can be asked during the exam.
12
Embed
2013 ‐‐ ENCM515 Assignment and Laboratory Detailspeople.ucalgary.ca/.../13_Labs/13_Assign1Details_LabOverview.pdf · 2013 ‐‐ ENCM515 Assignment and Laboratory Details ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2013 ‐‐ ENCM515 Assignment and Laboratory Details
Assignments are INDIVIDUAL and are intended to put you into a position to be able to answer
midterm exam, final exam and job interview questions.
Laboratory work is TEAM and is intended to allow you to work with somebody else to solve the issues of
designing, writing and testing high speed, highly optimized, programs for the SHARC in C++ and assembly code.
Also cuts down the total amount of writing and marking
Smith 20 minute rule as applied to assignments. It is perfectly acceptable to discuss all the details of the
assignments with your laboratory partner and other members of the class. However no code (paper or
electronic) may be taken away from the conversation. You need to commit the idea to long term memory and
then do the assignment on your own.
I don’t even mind if you apply the special –WP option. The –WP, or will power, option means
that you and your laboratory partner are permitted to work through all the assignment ideas together, provided
you have the will power to delete all the joint developed code, copying nothing from the laboratory code, and
start your own design from scratch. Your partner is not going to be able to help you during midterm and finals
Note that Midterm 1 (just after reading week) assumes that you have completed and
understand Assignment 1, Lab. 1 and Lab. 2
Note that Midterm 2 (end of March) assumes that you have completed and
understand Assignment 2 and Lab. 3
Reference material in midterm and final
You will be able to bring in SHARC reference sheet to midterms and final
We need to discuss about whether it makes sense to allow you to bring in copies
of your own assignment code into midterm and final. Would change the sort of
questions that can be asked during the exam.
Hunting the Ranchlands Hum
In terms of course work – Look at implementing very sharp (narrow bandwidth), real time, digital filters in
the time domain and frequency domain
Time domain Averaging filter in C++, assembly, optimized assembly, optimized C++ will be discussed
in detail in class
Time domain FIR filter in C++, assembly, optimized assembly, optimized C++ ‐‐ basically your
responsibility to implement in the labs and assignments; discussed in class
No code writing needed. ‐‐ Frequency domain sharp filter using discrete Fourier transform
implemented using fast Fourier transform (FFT) algorithm.
o I will provide you with a (slow) version in C++ which you can allow the C++ compiler to
optimize – half the teams.
o Use the Analog devices assembly language versions provide in VDSP 21469 directory – half
the teams.
o Share timing results across teams
No code writing needed. ‐‐ Half the lab teams will replace their FIR code with the FIR coprocessor
code available from VDSP 21469 directory. The other group of lab teams will replace their FFT code
with the FFT code available in VDSP 21469 directory.
o The idea is: Get your FIR code working on one channel and the coprocessor working in
parallel.
o We may do this during tutorial time rather than lab time.
No code writing needed. NO MARKS ‐‐ Pure personal satisfaction so that we can say – we have a
DSP cell‐phone application running. Discussed in class at the end of January
o Move the C++ FIR code (or FFT code if faster) over onto a pre‐existing cell phone application,
and make it run.
I have a windows mobile 5 application running
If you have Apple or Android experience, try it out there
Possibility of turning this into a class magazine article and making some money –
looks good on grad scholarship and industrial CVs
o Maximum of two hours to make it run, can work in teams of 4.
Alternately, somebody might want to take this on as a course project and let
everybody else have the code. Possibility of turning this into a class magazine
article and making some money – looks good on grad scholarship and industrial CVs
Project requirements
I have hum recordings available. However if you are interested, I have some pre‐amps and microphones
you can attach to cell‐phones and go and get recordings from your own house
Houses are noisy – hence recording will have noise as well as hum
o In particular you will pick up sounds at 60 Hz (light fitting etc) and 60 Hz harmonics (120 Hz, 180
Hz and the like)
We know that sometimes the Hum is at 42 Hz; but other times at 44 Hz (in different houses)
High frequency analysis (10 or more seconds of data) of the recordings show a 19.2 Hz signal and a 3rd
harmonic (19.2 * 3 Hz).
o Sometimes together, sometimes 19 Hz there then the 57.6 Hz there; oscillating in intensity.
o We need sharp filter to distinguish between 57.6Hz and 60 Hz signals
NICE TO HAVE ‐‐ During the day the traffic noise makes the hum difficult to hear. So we can play back
the hum recording (or current hum signals) using the sharp filters
NICE TO HAVE – If other strong signals there – provide an indication, perhaps store for later analysis.
Basic analysis:
Sometimes easier to analyze algorithms in frequency domain, and apply the results in time domain.
We are sampling at 96000 Hz on 21469 processor with provided audio code
To get a frequency resolution of ½ Hz with FFT (distinguish between 57.6 Hz and 60 Hz), we need 2
seconds of data or 200,000 points.
o I am guessing that to get a time domain FIR filter working we need a FIFO buffer of 200,000
coefficients (commonly called taps)
IN YOUR INDIVIDUAL ASSIGNMENT ONE REPORT, associated with Lab 0
o Do a 21469 resource chart analysis (6 lines) of a (time domain) FIR filter to see how many cycles
(un‐optimized) it takes to do one tap
You don’t need to have working code to do this analysis; the code will be discussed in
class and is also in that IEEE Micro article on the web‐page and discussed in detail on the
ECE‐ADI project web pages
o Work out how many cycles it would need to perform a 200,000 FIR filter if you can find all
possible optimizations improvement
That’s add, multiply, two memory fetches on both 21469 ALUs (dual mode) in one
instruction cycle if using hardware loop – tackled next week and also done by compiler
o Assume using float point operations – still single cycle on this processor
o Prove that this algorithm will not give real time performance at 96000 Hz sampling rate on a
500 MHz 21469 processor. (What is the actual processor clock rate of the 21469?)
SO HOW CAN WE DO IT?
TIME DOMAIN APPROACH: Something like this
Develop the FIR code and test using Embedded unit
Adjust number of taps to 1024 (FIFO buffer size)
Inside the audio processing routine – TTCOS_AddPremptiveTask( ) ‐‐ TTCOS interrupt service routine call‐
back function
o Grab one new value, put into ISR FIFO input array
o Perform FIR and put output into ISR FIFO input array
o Every 32 input sample, calculate the average of last 32 filtered examples and put in new external
FIFO array of size 256 (Numbers may need adjusting)
Every time there are 64 new values in the external array, the ISR call back function launches a run‐once
TTCOS_AddTask (runs outside ISR – so can run slowly) which filters the output array using a second copy of
the FIR code.
o The output of this second FIR filter is a filtered version of a down sampled filtered version of the
original signal using just FIR code and averaging code routines
It is this filtered output we analyse as follows
o The original filtering and down sampling give us a bandwidth of around 300 Hz
o On the down‐sampled filtered signal – we use a series of sharp FIR filters to detect signals in 18 to
20Hz range, 40 to 44 Hz range, 56 to 59 Hz range.
Matlab has programs to give use the filter coefficients automatically.
o Personally I would try the code out in Matlab (about 10 lines of code) to make sure that the
algorithm works before I move over to writing C++. We will discuss some possible code in class
FREQUENCY DOMAIN APPROACH: So much easier to understand – also FFT is faster (discussed in class)
Inside TTCOS_AddPremptiveTask( ) capture 2048 points into buffer1
o Launch a run‐once TTCOS_AddTask (not pre‐emptive) TASKA when buffer is full and start filling
buffer 2 inside ISR so don’t miss any audio values
TASKA must complete within 2048 x 1/ 96000 second (before next buffer is full)
o Do 2048 point FFT. There are now 2048 frequency points
o Keep frequency points associated with low frequencies (set everything else to zero) – that’s how
simple filtering is in time domain
o Do inverse FFT (IFFT) back to time domain
o Take every 128th point from filtered signal and put in new array of size 256 (thats the down‐
sampling). Every time that new buffer is full; launch another run‐once TTCOS_AddTask TASK B
TaskB is identical to TASKA (actually can be the same task, because TTCOS is a co‐operative scheduler)
except runs on a different buffer
o Except that this code only runs the FFT part and not the IFFT part. We can just pick out the 19 Hz, 42
Hz, 57 Hz signals directly from the frequency information :‐)
o Again try it out in Matlab ‐‐ We will discuss some possible code in class
NOTE: ALL THIS IS DONE INSIDE TTCOS ‐‐ Operating system with just 3 pieces of code – Average, FIR
and FFT
Assignment 1 due: Includes audio demos and testing and a discussion on your timing report. Due 6th Feb.
Combine your team timing results with your individual timing results. Include documented code and a
comparison between the expected performance (theoretical) and actual performance.
There is a single Lab. 1 / Lab. 2 report – common to both people in the team. Includes audio demos and testing.
This is due 2 weeks after Lab. 2 is held (excluding reading week). Include documented code and a comparison
between the expected performance (theoretical) and actual performance – include resource charts and the like.
Details to come.
Assignment 2 (Individual): Using the optimization techniques learnt during Lab. 1, Lab. 2 and Lab. 3 to develop
DSP code that works efficiently on “blocks” of data (double buffering and circular buffers) and executes
“outside” of ISR. Includes audio demos and testing and a discussion of your timing report. Include documented
code and a comparison between the expected performance (theoretical) and actual performance.
There is a single Lab. 3 / Lab. 4 report – common to both people in the team. Includes audio demos and testing
and timing report. This is due 1 weeks after Lab. 4 is held. Include documented code and a comparison between
the expected performance (theoretical) and actual performance.
OVERVIEW – More details are given later.
Introductory Laboratory (Team) / Assignment 1 (Individual): Basic practice with SHARC C++, assembly code
combined with real‐time validation for a simple routine (Averaging). Includes audio demos and testing and a
discussion on your timing report. Due 6th Feb. Combine your team timing results with your individual timing
results. Include documented code
Lab. 1 (Team): – Introduction to a structured process for developing DSP algorithms and comparing
performance of C (debug and release mode) and assembly code development (software loops and hardware
loop optimization only).
Lab. 2 (Team): – Assembly code optimization including dual data access and comparing performance when
‘instruction cache thrash’ does not occur and does occur. C++ optimization to include ‘function in‐lining’ and
demonstrating the use of the SHARC profiler to identify ‘where the code spends most of its time’. This approach
identifies the code that you will spend most of your time optimizing.
Lab. 3 (Team): Design and then implement further optimizations of your FIR algorithm. Compare using C++ DSP
language extensions (minimum of 4 extensions – use mixed mode (screen capture) demonstrate activation of
SIMD mode, ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions (simultaneous add and
multiple operations)) when in debug and release mode. Design and implement an optimized assembly language
version demonstrating ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions and SIMD
mode. Compare and contrast performance
Lab. 4 (Team): Using the example code available with VDSP, compare the performance of the hardware ‘FIR’
DSP accelerator on the SHARC with your ‘best’ FIR implementation using the C++ compiler and your ‘best’ FIR
implementation in assembly code.
DETAILS ‐‐ We are trying to demonstrate that we understand why the way we write our C++ code and
assembly code for an implementation impacts the speed of the code.
IMPORTANT: If we really wanted to use these techniques ‘properly’ we would need complex algorithms to
implement. Each lab would take weeks. So instead we will demonstrate the ideas on simple algorithms where
the techniques are probably over‐kill, but at least do‐able in the time available.
IMPORTANT – Most of the changes involve cut‐and‐paste of existing code. However it is very important that
each C++ version of the function Average( ) or FIR( ) is in a different file, different data buffer names and loop
variable names. If this is not done, then the C++ compiler will use the wrong information and the results will be
meaningless. The details in Assignment 1 will show how to solve this effect.
IMPORTANT: Remember – timing tests must run with long averaging filters to match theory
For example
Step 1: ‐‐ we write a simple version of the averaging filter discussed in class – AveragingFilter_UnoptimisedCPP( )
and place in a file AveragingFilter_UnoptimisedCPP.cpp. We test, and time and use inside TTCOS with the
compiler running in debug mode
Step 2A: ‐‐ Make a copy of the file AveragingFilter_UnoptimisedCPP.cpp as AveragingFilter_OptimisedCPP.cpp
and change the name of the function to AveragingFilter_OptimisedCPP( ) – add to the project
Step 2B: ‐‐ Make copies of all the existing tests for AveragingFilter_UnoptimisedCPP( ) (same test file) and
rewrite the tests to call AveragingFilter_OptimisedCPP( ). Should give the same results as Step2A – simply a
test to show you have done the code renaming correctly
Step 2C: ‐‐ Important: Right click on AveragingFilter_OptimisedCPP.cpp – Select file options – select custom of
is it file specific, then select ‘optimize mode’. If you run the timing tests now, you should get much faster
performance that matches your theoretical results. – I will show you how to do this step during the lab.
Step 2D: Use mixed mode to see the optimized assembly code the compiler generated – see if there are any
special instructions that we need to learn to use
Step 3A: ‐‐ Make copies of all the existing tests for AveragingFilter_UnoptimisedCPP( ) (same test file) and
rewrite the tests to call AveragingFilter_UnoptimizedASM( ). Build a stub version of the function
AveragingFilter_UnoptimisedASM( ) and place in a file called AveragingFilter_UnoptimisedASM.asm. If you
don’t follow this instruction exactly, you will find everything breaks. Explain why this is a problem as part of
your assignment 1 report.
Step 3B: Compile and link the new tests and stub. This shows you have the right names.
Step 3C: Run the assembly code tests – they should fail, but at least this shows that you have set the code up
correctly.
Step 3D: Write the unoptimized assembly code and run tests and timing. WAIL on optimized assembly code.
Step 4A: Put algorithm into resource chart – find out what the theoretical unoptimized and optimized code
speeds should be.
Step 4B: Short report on timing (compare theory and actual; your assembly code and compiler code). What
techniques does the C compiler show you need to learn.
Lab. 0: (Team): Demonstrate you can use VDSP, TTCOS and EmbeddedUnit on C++ and assembly code
environments. Set up things for the hum analysis Design, implement a simple function in C++ and assembly
code. Using the testing framework show that all versions of the algorithm work correctly. Compare and contrast
the speed performance of the algorithm in C++ (debug and release) and assembly code (un‐optimized and using
software loop) with the ‘expected’ speed for the way you wrote your C code and assembly code assuming that
each SHARC instruction executes in 1 cycle. Show that the averaging function processing audio date works with
data lengths of 256 to 1024. Definition of working: ‐‐ Algorithm completes execution in 10% of the time for 1
audio sample period (1/41000).
Requirements:
Demonstrate EmbeddedUnit Knowledge – Get code to work and generate timing (compare to theory) for
unoptimized and optimized C++ code)
A) Set up a C++ task void Mock_42Hz_HumAnalysisAlgorithmCPP(void) with the following properties.
1) Each time we call (enter) the code there is a probability PEXISTS that the 42 Hz signal exists; if so turn
on the 42Hz warning flag (external global variable semaphore present42Hz – use typedef
semaphore unsigned int ). Use VDSP help to find out how to use rand( ) which is a standard part of
the C++ library.
2) Each time we enter the code – we will turn off the warning signal if
X seconds of Hum has passed and there is a probability PGONE that the signal no longer exists
3) Should be no more than about 10 lines of code; and some significant part of the code should be in a
subroutine. (Having stuff in a subroutine gives us the opportunity to see how the C++ compiler can /
can’t optimize subroutine calls)
4) Call the routine TTCOS2011_WriteLED(int value) with a value 1 to turn on LED1 if hum present. You
can’t see this happening until we burn the flash memory on an external board. However you can
check that it occurs by calling unsigned int TTCOS2011_ReadLED( ).
5) Once the code is working, duplicate the code so that have mock analysis tasks for the 19 Hz, 44Hz
and 57 Hz hums (same code; different names and probabilities)
B) Set up a C++ task void SetUpSineArrayCPP(float *aSineWave, int arrayLength) which fills the array
aSineWave with exactly one sine wave cycle in arrayLength points. This allows to actually set up arrays in
different poart of memory so we can practice some of the fancy go‐fast technichnique)
C) Set up a C++ task void Mock_GenerateSonic_AlarmsCPP(void)
Each time we enter –
a. copy input audio channel 1 left to output channel 1 left
b. Use the code from class generate an average, ClassAverageFilterCPP( ); of channel 1 left input and
output to channel 1 right
c. If semaphore present42Hz is equal to 1 grab values from SineWaveArray1 and output to channel 2
left output. Move through the array so that if this routine was called at 96000 Hz, it would generate
a 200 Hz tone. Watch out for volume issues as when we use the code fopr real – we want to be able
to hear at the end of the course
d. NICE TO HAVE ‐‐ More pleasant if we have overtones – meaning use generate 200 Hz with a little
(1/4) of the first overtone (400 Hz) plus 1 /16 of second overtone (third harmonic)
e. Once parts c and d are working – duplicate for semaphore present57Hz which grabs values from
SineWaveArray2 and ouputs to channel right output a 275 Hz tone
D) Lab 0: Team – use the assembly code given in class, generate an assembly language version of the average
routine ClassAverageFilterASM( );
Assignment 1: Individual – Generate the (un‐optimized ) assembly code for the
Mock_GenerateSonic_AlarmsASM( )
Assignment 1: Individual – Using the code from class generate a C++ routine float FIRfilter(float *FIFO, float
* FIRcoffs, int FIR length) which uses pointer arithmetic to access external arrays and returns the filtered
result. NOTE: using external arrays by pass by subroutine calls allows us to use one set of filter code for
many different channels (if careful)
Assignment 1: Individual – Using the code from class generate a C++ routine float FIRfilter(float FIFO[ ]
float FIRcoffs[ ], int FIR length) which uses array arithmetic to access external arrays and returns the filtered
result.
TTCOS demonstration
1) As you get the code working, then port to the TTCOS environment and try it for real
2) Lab 0 – just need to demo that tests work and audio works
3) Because we are pushed for time in getting ready for Lab 1
Assignment 1 – hand in documented code that test work and demo next lab – due by 6th Feb
Timing analysis (theory versus practical) we will make due the following week (13th Feb)
Make the timing analysis for the C++ FIR filter part of Lab. 1
OTHER USEFUL INFORMATION:
Pseudo code for C++ and assembly code
Define the length of the filter as a globally defined constant in both C++ and assembly code.
e.g.
#define PREDEFINED_N 256
float FIFO_ Simple_Average[ ]
void Simple_Average(void) {
FIFO_ Simple_Average filter update
Add new value
for (I = 0; I < PREDEFINED_N ; I++)
do average of FIFO_ Simple_Average
}
Express timing in terms of processing cycles / point and not total execution time. Do timing with N a nice
number (e.g. power of 2) and not a nice number (e.g. 241). Expected difference – C++ release slightly better for N
a nice number.
Place tests in 1 file, place averaging filter in a different file. Discuss (compare and contrast) the speed impact of
passing parameters C++ and assembly code Code written with no optimization in C++ (debug mode) and then
allowing the C++ compiler to optimize the code (release mode).
Separate file
#define PREDEFINED_N_V1 256 and 243
float FIFO_ AveragePass_Parameters_V1 [1024 ]
float AveragePass_Parameters_V1(float newValue,
float FIFO [ ], int FIFO_length);
TEST( ) {
float inValue = XXX;
Time a 1000 calls of
AveragePass_Parameters_V1(inValue,
FIFO_ AveragePass_Parameters_V1,
PREDEFINED_N_V1)
Test (CHECK) code works
//******************
#define PREDEFINED_N_V2 256 and 243
Separate file ‐‐ index notation for accessing array
float AveragePass_Parameters_V1(float newValue,
float FIFO [ ], int FIFO_length) {
FIFO filter update
Add new value
for (I = 0; I <FIFO_length ; I++)
do average of FIFO[ ]
// Use index notation for accessing array
}
Separate file ‐‐ post‐increment pointer notation for accessing array
float AveragePass_Parameters_V1(float newValue,
float FIFO_ AveragePass_Parameters_V2 [1024 ]
float AveragePass_Parameters_V2(float newValue,
float *FIFO, int FIFO_length);
TEST( ) {
float inValue = XXX;
Time a 1000 calls of
AveragePass_Parameters_V1(inValue,
FIFO_ AveragePass_Parameters_V2,
PREDEFINED_N_V2)
Test (CHECK) code works
float *FIFO, int FIFO_length) {
FIFO filter update
Add new value
for (I = 0; I < FIFO_length ; I++)
do average of FIFO [ ]
// Use post‐increment pointer notation for accessing array
}
Expectations for timing: This is discussed in the articles ‘The byte of the SHARC’ which can be downloaded from the
ENCM515 web site
C debug mode – index access to array will be slower than post‐increment access
C release mode – if the compiler ‘is good’, it will switch automatically to the faster way of accessing code (meaning it might
do indexing for the FIFO update and increment mode for the averaging). Examine the code in MIXED mode to see how well
the compiler is optimizing
Assembly code – I would expect that your code will be slower by around 4 * N cycles (4 cycles per point) if you code using
index mode compared using array mode. NOTE most of your code will be cut‐and‐paste – so check that your code works
Lab. 1 (Team): – Introduction to a structured process for developing DSP algorithms and comparing
performance of C (debug and release mode) and assembly code development (software loops and hardware
loop optimization only).
Basically a repeat of the ideas for assignment 1 except that you are using an FIR filter rather than an averaging
filter
Expectation: code slower by 2 * N cycles (2 cycles / point) than averaging as you need to an additional memory
access and a multiply
Lab. 2 (Team): – Assembly code optimization including dual data access and comparing performance when
‘instruction cache thrash’ does not occur and does occur. C++ optimization to include ‘function in‐lining’ and
demonstrating the use of the SHARC profiler to identify ‘where the code spends most of its time’. This approach
identifies the code that you will spend most of your time optimizing.
Will demonstrate code in lining in class – expectation – reduce time for 1000 calls to be around 1000 * time for
subroutine (1000 * 20 cycles) plus time to set up and use parameters (1000 * 5 cycles). Very important if
subroutines that are called from inside a loop are short (FIR length small) .
Doing dual access (dm and pm in C++, and I4, I12 in assembly code) will improve speed by around N cycles (1
cycle / point) but may be faster than that as the C compiler might start putting instructions in delay slots (out of
order execution)
Lab. 3 (Team): Design and then implement further optimizations of your FIR algorithm. Compare using C++ DSP
language extensions (minimum of 4 extensions – use mixed mode (screen capture) demonstrate activation of
SIMD mode, ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions (simultaneous add and
multiple operations)) when in debug and release mode. Design and implement an optimized assembly language
version demonstrating ‘software pipelining’, parallel memory accesses and ‘COMPUTE’ instructions and SIMD
mode. Compare and contrast performance
Lab. 4 (Team): Using the example code available with VDSP, compare the performance of the hardware ‘FIR’
DSP accelerator on the SHARC with your ‘best’ FIR implementation using the C++ compiler and your ‘best’ FIR