CHAP TER 1 1.1 Introduction Noise is a random fluctuation in an electrical signal, a characteristic of all electronic circuits. Noise generated by electronic devices varies greatly, as it can be produced by several different effects. In communication systems, the noise is an error or undesired random disturbance of a useful information signal. Denoising is the extraction of a signal from a mixture of signal and noise . This is the first step in many applications. In this project, DWT is used for De-noising a one dimensional signal. The linear methods of de-noising (like Filtering) have the drawback of either removing sharp features (sudden changes) or not completely removing noise. The DWT is a non-linear method that separates the signal from noise by comparing their amplitude rather than their spectra. 1.2 Aim of the project The aim of the project is to de-noise a real time signal and to design a suitable architecture for high speed implementation. 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CHAPTER 1
1.1 Introduction
Noise is a random fluctuation in an electrical signal, a characteristic of all electronic
circuits. Noise generated by electronic devices varies greatly, as it can be produced by
several different effects. In communication systems, the noise is an error or undesired
random disturbance of a useful information signal. Denoising is the extraction of a signal
from a mixture of signal and noise . This is the first step in many applications.
In this project, DWT is used for De-noising a one dimensional signal. The linear methods
of de-noising (like Filtering) have the drawback of either removing sharp features
(sudden changes) or not completely removing noise. The DWT is a non-linear method
that separates the signal from noise by comparing their amplitude rather than their
spectra.
1.2 Aim of the project
The aim of the project is to de-noise a real time signal and to design a suitable
architecture for high speed implementation.
1.3 Methodology
The test signal is initially analyzed in MATLAB using a suitable mother wavelet. Then it
is decomposed into required number of levels and denoised using a suitable threshold
rule.
De-noising using DWT is realized using the concept of Parallel Distributed Arithmetic in
The filtering part of the reconstruction process is important, because it is the choice of
filters that is crucial in achieving perfect reconstruction of the original signal.
The down sampling of the signal components performed during the decomposition phase
introduces a distortion called aliasing. It turns out that by carefully choosing filters for the
decomposition and reconstruction phases that are closely related (but not identical), we
the effects of aliasing can be cancelled out.
The low- and high-pass decomposition filters (L and H), together with their associated
reconstruction filters (L' and H'), form a system of what is called quadrature mirror
filters:
Fig 2.23 (a) Decomposition and (b) Reconstruction
2.1.14.1Reconstructing Approximations and Details
It is possible to reconstruct our original signal from the coefficients of the approximations
and details.
Fig 2.24 Reconstructing approximations and details
29
It is also possible to reconstruct the approximations and details themselves from their
coefficient vectors. As an example, consider how the first-level approximation A1 can be
reconstructed from the coefficient vector cA1.
Coefficient vector cA1 is passed through the same process used to reconstruct the
original signal. However, instead of combining it with the level-one detail cD1, we feed
in a vector of zeros in place of the detail coefficients vector:
Fig 2.25 Reconstructing the signal from approximations
The process yields a reconstructed approximation A1, which has the same length as the
original signal S and which is a real approximation of it.
Similarly, the first-level detail D1 can be reconstructed, using the analogous process:
Fig 2.25 Reconstructing the signal from details
The reconstructed details and approximations are true constituents of the original signal.
In fact, when we combine them it can be found that
30
The coefficient vectors cA1 and cD1 -- because they were produced by down
sampling and are only half the length of the original signal -- cannot directly be combined
to reproduce the signal. It is necessary to reconstruct the approximations and details
before combining them.
Extending this technique to the components of a multilevel analysis, we find that similar
relationships hold for all the reconstructed signal constituents. That is, there are several
ways to reassemble the original signal:
Fig 2.26 Reconstructed signal components
2.2 Distributed Arithmetic (DA)
2.2.1 Distributed Arithmetic at a Glance
The arithmetic sum of products that defines the response of linear, time-
invariant networks can be expressed as:
Equ 2.3
Where
31
is response of network at time n.
is kth input at time n.
is weighing factor of kth input variable that is constant for all n,
and so it remains time-invariant.
In filtering applications the constants, Ak , are the filter coefficients and the variables,
xk , are the prior samples of a single data source (for example, an analog to digital
converter). In frequency transforming - whether the discrete Fourier or the fast Fourier
transform - the constants are the sine/cosine basis functions and the variables are a block
of samples from a single data source. Examples of multiple data sources may be found in
image processing.
The multiply-intensive nature of equ2.3 can be appreciated by observing that a
single output response requires the accumulation of K product terms. In DA the task of
summing product terms is replaced by table look-up procedures that are easily
implemented in the Xilinx configurable logic block (CLB) look-up table architecture.
We start by defining the number format of the variable to be 2’s complement, fractional -
a standard practice for fixed-point microprocessors in order to bound number growth
under multiplication. The constant factors, Ak, need not be so restricted, nor are they
required to match the data word length, as is the case for the microprocessor. The
constants may have a mixed integer and fractional format; they need not be defined at
this time. The variable, xk, may be written in the fractional format as shown in equ. 2.4
Equ 2.4
where xkb is a binary variable and can assume only values of 0 and 1. A sign bit of value
-1 is indicated by xk0. The time index, n, has been dropped since it is not needed to
continue the derivation. The final result is obtained by first substituting equ.2.4 into
equ.2.3.
32
Equ 2.5
and then explicitly expressing all the product terms under the summation symbols:
Equ 2.6
Each term within the brackets denotes a binary AND operation involving a bit of the
input variable and all the bits of the constant. The plus signs denote arithmetic sum
operations. The exponential factors denote the scaled contributions of the bracketed pairs
to the total sum. Construct a look-up table that can be addressed by the same scaled bit
of all the input variables and can access the sum of the terms within each pair of brackets.
Such a table is shown in fig.2.26 and will henceforth be referred to as a Distributed
Arithmetic look-up table or DALUT. The same DALUT can be time-shared in a serially
organized computation or can be replicated B times for a parallel computation scheme.
33
Fig2.27 The Distributed Arithmetic Look-up Table (DALUT)
The arithmetic operations have now been reduced to addition, subtraction, and binary
scaling. With scaling by negative powers of 2, the actual implementation entails the
shifting of binary coded data words toward the least significant bit and the use of sign
extension bits to maintain the sign at its normal bit position. The hardware
implementation of a binary full adder (as is done in the CLBs) entails two operands, the
addend and the augends to produce sum and carry output bits. The multiple bit-parallel
34
additions of the DALUT outputs expressed in equ.2.6 can only be performed with a
single parallel adder if this adder is time-shared. Alternatively, if simultaneous addition
of all DALUT outputs is required, an array of parallel adders is required. These opposite
goals represent the classic speed-cost tradeoff.
2.2.2 The Speed Tradeoff
Any new device that can be software configured to perform DSP functions must contend
with the well entrenched standard DSP chips, i.e. the programmable fixed point
microprocessors that feature concurrently operating hardware multipliers and address
generators, and on-chip memories. The first challenge is speed. If the FPGA doesn’t offer
higher speed why bother. For a single filter channel the bother is worth it - particularly as
the filter order increases. And the FPGA advantage grows for multiple filter channels.
Alas, a simple advantage may not be persuasive in all cases - an overwhelming speed
advantage may be needed for FPGA acceptance. To reach 50 megasamples/sec data
sample rates we require high cost in gate resources. The first two examples will show the
end points of the serial/parallel tradeoff continuum.
2.2.3 The Ultimate in Speed
Conceivably, with a fully parallel design the sample speed could match the system clock
rate. This is the case where all the add operations of the bracketed values (the DALUT
outputs) of equ.2.6 are performed in parallel. Gain implementation guidance can be done
by rephrasing equ.2.6, and to facilitate this process, abbreviate the contents within each
bracket pair by the data bit position. Thus
For B=16, equ 2.6 becomes:
35
Equ 2.7
The decomposition of Equ 2.7 into an array of two input adders is given below:
Equ 2.8
Equations 2.7 and 2.8 are computationally equivalent, but equ.2.8 can be mapped in a
straight forward way into a binary tree-like array of summing nodes with scaling affected
by signal routing as shown in fig. 2.27. Each of the 15 nodes represents a parallel adder,
and while the computation may can yield responses that include both the double precision
(B+C bits) of the implicit multiplication and the attendant processing gain, these adders
36
can be truncated to produce single precision (B bits) responses.
Fig. 2.28 Example of Fully Parallel DA Model (K=16, B=16)
All B bits of all K data sources must be present to address the B DALUTS. A BxK array
of flip-flops is required. Each of the B identical DALUTS contains 2K words with C bits
per word where C is the “cumulative” coefficient accuracy. The data flow from the flip-
flop array can be all combinatorial; the critical delay path for B=16 is not inordinately
long - signal routing through 5 CLB stages and a carry chain embracing 2C adder stages.
A system clock in the 10 MHz range may work. Certainly with internodes pipelining a
system clock of 50 MHz appears feasible. The latency would, in many cases be
acceptable; however, it would be problematic in feedback networks (e.g., IIR filters).
37
2.2.4 The Ultimate in Gate Efficiency
The ultimate in gate efficiency would be a single DALUT, a single parallel adder, and, of
course, fewer flip-flops for the input data source. Again with our B=16 examples, a
rephrasing of equ.2.6 yields the desired result:
Equ 2.9
Starting from the least significant end, i.e. addressing the DALUT with the least
significant bit of all K input variables the DALUT contents, [sum15], are stored, scaled
by and then added to the DALUT contents, [sum14] when the address changes to the
next-to-the-least-significant bits. The process repeats until the most significant bit
addresses the DALUT, [sum0]. If this is a sign bit a subtraction occurs. Now a vision of
the hardware emerges. A serial shift register, B bits long, for each of the K variables
addresses the DALUT least significant bit first. At each shift the output is applied to a
parallel adder whose output is stored in an accumulator register. The accumulator output -
scaled by 2-1 is the second input to the adder. Henceforth, the adder, register and scalar
shall be referred to as a scaling accumulator. The functional blocks are shown in fig.
2.29. All can be readily mapped into the Xilinx 4000 CLBs. There is a performance price
to be paid for this gate efficiency - the computation takes at least B clocks.
38
Fig2.29 Serially Organized DA processor
2.2.5 Between the Extremes
While there are a number of speed-gate count tradeoffs that range from one bit per clock
(the ultimate in gate efficiency) to B bits per clock (the ultimate in speed) the question of
their effectiveness under architectural constraints remains. We can start this study with
the case of 2 bit-at-a-time processing; the computation lasts B/2 clocks and the DALUT
now grows to include two contiguous bits, i.e. [sum b + {sum(b+1)}2-1]. Again consider
the case of B = 16 and rephrasing equ. 2.7:
39
Equ 2.10
The terms within the rectangular brackets are stored in a common DALUT which can
also serve [sum0] and [sum15]. Note that the computation takes B/2 +1 or 9 clock
periods. The functional blocks of the data path are shown in fig. 2.30(a). The odd valued
scale factors outside the rectangular brackets do introduce some complexity to the circuit,
but it can be managed;
40
Fig. 2.30(a) Two-bit-at-a-time Distributed Arithmetic Data Path (B=16, K=16)
The scaling simplifies with magnitude-only input data. Furthermore, the two bit
processing would last for 8 clock periods. Thus:
Equ 2.11
There is another way of rephrasing or partitioning equ. 2.7 that maintains the B clock
computation time:
41
Equ 2.12
Here two identical DALUTs, two scaling accumulators, and a post-accumulator adder
(fig.2.30(b)) are required. While the adder in the scaling accumulator may be single
precision, the second adder stage may be double precision to meet performance
requirements.
Fig 2.30(b) Two-bit-at-a-time Distributed Arithmetic Data Path (B=16, K=16)
There are other two-bit-at-a-time possibilities. Each possibility implies a different circuit
arrangement. Consider a third rephrasing of equ. 2.7.
42
Equ 2.13
Here the inner brackets denote a DALUT output while the larger, outer brackets denote
the scaling of the scaling accumulator.. Two parallel odd-even bit data paths are indicated
(fig.2.30c) with two identical DALUTs. The DALUT addressed by the even bits has its
output scaled by 2-1 and then is applied to the parallel adder. The adder sum is then
applied to the scaling accumulator which yields the desired response, y(n). Here a single
precision pre-accumulator adder replaces the double precision post accumulator double
precision adder.
43
Fig.2.30(c) Two-bit-at-a-time Distributed Arithmetic Data Path (B=16, K=16)
Each of these approaches implies a different gate implementation. Certainly one of the
most important concerns is DALUT size which is constrained by the look-up table
capacity of the CLB. The first approach, defined by equ.5b, describes a DALUT of 22K
words that feeds a single scaling accumulator, while the second, defined by equ.5c,
describes 2 DALUTs -each 2K words - that feed separate scaling accumulators. An
additional parallel adder is required to sum (with the 2-B/2 scaling indicated) the two
output halves. The difference in memory sizes between 22K and 2x2K is very significant
particularly when we confront reality, namely the CLB memory sizes of 32x1 or
2x(16x1) bits.
2.2.6 Parallel Realization
In its most obvious and direct form, distributed arithmetic computations are bit-serial in
ature, i.e., each bit of the input samples must be indexed in turn before a new output
sample becomes available. When the input samples are represented with B bits of
44
precision, B clock cycles are required to complete an inner-product calculation. A parallel
realization of distributed arithmetic corresponds to allowing multiple bits to be processed
in one clock cycle by duplicating the LUT and adder tree. In a 2-bit at a time parallel
implementation, the odd bits are fed to one LUT and adder tree, while the even bits are
simultaneously fed to an identical tree. The bits partials are left shifted to properly weight
the result and added to the even partials before accumulating the aggregate. In the
extreme case, all input bits can be computed in parallel and then combined in a shifting
adder tree.
Fig 2.31 Mallat’s quadratic mirror filter tree used to compute the coefficients of the (a)
forward and (b) inverse wavelet transforms.
45
CHAPTER 3
3.1 Introduction
This chapter describes the detailed procedure adopted to denoise the signal. It also
explains in details the MATLAB functions involved in the process.
3.2 Testing in MATLAB
Steps involved in denoising the signal using MATLAB are Load a signal Perform a single-level wavelet decomposition of a signal Construct approximations and details from the coefficients Display the approximation and detail Perform a multilevel wavelet decomposition of a signal Extract approximation and detail coefficients Apply thresholding to detail coefficients Reconstruct the level 3 approximation Display the results of a multilevel decomposition Reconstruct the original signal from the level 3 decomposition
3.3 Functions involved in denoising a signal
3.3.1 Analysis-Decomposition Functions
3.3.1.1 dwt
Purpose
Single-level discrete 1-D wavelet transform
Syntax
[cA,cD] = dwt(X,'wname')
[cA,cD] = dwt(X,'wname','mode',MODE)
[cA,cD] = dwt(X,Lo_D,Hi_D)
[cA,cD] = dwt(X,Lo_D,Hi_D,'mode',MODE)
Description
46
The dwt command performs a single-level one-dimensional wavelet decomposition with
respect to either a particular wavelet or particular wavelet decomposition filters (Lo_D
and Hi_D) that we specify.
[cA,cD] = dwt(X,'wname') computes the approximation coefficients vector cA and detail
coefficients vector cD, obtained by a wavelet decomposition of the vector X. The string
'wname' contains the wavelet name.
[cA,cD] = dwt(X,Lo_D,Hi_D) computes the wavelet decomposition as above, given
these filters as input:
Lo_D is the decomposition low-pass filter.
Hi_D is the decomposition high-pass filter.
Lo_D and Hi_D must be the same length.
Let lx = the length of X and lf = the length of the filters Lo_D and Hi_D; then
length(cA) = length(cD) = la where la = ceil(lx/2), if the DWT extension mode is set to
periodization. For the other extension modes, la = floor(lx+lf-1)/2).
[cA,cD] = dwt(...,'mode',MODE) computes the wavelet decomposition with the
extension mode MODE that you specify. MODE is a string containing the desired
extension mode.
Example:
[cA,cD] = dwt(x,'db1','mode','sym');
47
3.3.1.2 wavedec
Purpose
Multilevel 1-D wavelet decomposition
Syntax
[C,L] = wavedec(X,N,'wname')
[C,L] = wavedec(X,N,Lo_D,Hi_D)
Description
wavedec performs a multilevel one-dimensional wavelet analysis using either a specific
wavelet ('wname') or a specific wavelet decomposition filters. [C,L] =
wavedec(X,N,'wname') returns the wavelet decomposition of the signal X at level N,
using 'wname'. N must be a strictly positive integer . The output decomposition structure
contains the wavelet decomposition vector C and the bookkeeping vector L. The structure
is organized as in this level-3 decomposition example.
48
Fig 3.1 Decomposition structure
[C,L] = wavedec(X,N,Lo_D,Hi_D) returns the decomposition structure as above, given
component interpolator_hpf_rec isport (clock: in STD_LOGIC; datain:in std_logic_vector(15 downto 0); samplecnt,levelno:in integer;countout:out integer;doutput: out std_logic_vector(15 downto 0));
end component;
signal din,fir1: std_logic_vector(15 downto 0);signal ca_cnt,cd_cnt: integer; signal ca_clock: std_logic;signal ca_dec,cd_dec,ca_rec,cd_rec,cd1_dec:std_logic_vector(15 downto 0);begin
c0:fileread port map(clockin,din);c1:decimator_lpf_dec port
map(clockin,din,samplecount,1,ca_cnt,ca_clock,fir1,ca_dec); c2:decimator_hpf_dec port map(clockin,din,samplecount,1,cd_dec);
c3:interpolator_lpf_rec port map(clockin,ca_dec,ca_cnt,2,ca_rec);c4:interpolator_hpf_rec port map(clockin,cd_dec,ca_cnt,2,cd_cnt,cd_rec); dout<=ca_rec + cd_rec;
end dwt_single_level;
70
4.3.Results
Fig 4.6 Samples taken from MATLAB
4.4. Comparision of the results
71
72
73
Fig 4.9 MATLAB values for denoised signal
74
4.5 Conclusion
Real world signals are often corrupted by noise which may severely limit their
usefulness. For this reason, signal denoising is a topic that continually draws great
interest. Wavelets are an alternative tool for signal decomposition using orthogonal
functions. Unlike basic Fourier analysis, wavelets do not lose completely time
information, a feature that makes the technique suitable for applications where the
temporal location of the signal’s frequency content is important. One of the fields where
wavelets have been successfully applied is data analysis. In particular, it has been
demonstrated that wavelets produce excellent results in signal denoising. This work
presents a procedure to denoise a signal using discrete wavelet transform. A real-time
electrical signal contaminated with noise is used as test bed for the method. The
simulation result of the suggested design is presented. The future work includes using
multiwavelets to denoise a signal.
75
References
[1] Texas Corporation, www.ti.com
[2] M. Smith, Application-specific integrated circuits.USA: Addison Wesley Longman,
1997.
[3] R. Seals and G. Whapshott, Programmable Logic: PLDs and FPGAs. UK:
Macmillan, 1997.
[4] P. Kollig, B. Al-Hashimi and K. Abbot, “ FPGA implementation of high performance
FIR filters,” In Proc. International Symposium on Circuits and Systems, 1997.
[5] M. Shand, “ Flexible image acquisition using reconfigurable hardware,” In Proc. of
the IEEE Workshop on Filed Programmable Custom Computing Machines, Napa,
Ca, Apr. 1995.
[6] J. Villasenor, B. Schoner, and C. Jones, “Video communication using rapidly
reconfigurable hardware,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 5, no. 12, pp. 565-567, Dec. 1995.
[7] L. Mintzer, “The role of distributed arithmetic in FPGAs,” Xilinx Corporation.
[8] K. Parhi, VLSI digital signal processing systems. US: John Wiley & Sons, 1999
[9] G. Strang and T. Nguyen, Wavelets and filter banks. MA: Wellesley-Cambridge
Press, 1996.
[10] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding using
wavelet transform,” IEEE Trans. Image Processing, vol. 1, no.2, pp. 205-220, April
1992.
[11] T. Ebrahimi and F. Pereira, The MPEG-4 Book. Prentice Hall, July 2002
[12] D. Taubman and M. Marcellin. JPEG2000: Image compression fundamentals,
standards, and practice. Kluwer Academic Publishers, November, 2001,
[13] Xilinx Corporation. “Xilinx breaks one million-gate barrier with delivery of new
Virtex series,” October 1998
[14] G. Knowles, “VLSI architecture for the discrete wavelet transform,” Electron
Letters, vol. 26, no.15, pp. 1184-1185, July 1990.
76
[15] A. Grzeszczak, M. Kandal, S. Panchanathan, and T. Yeap, “ VLSI implementation
of discretewavelet transform,” IEEE Trans. VLSI Systems, vol. 4, no. 4, pp. 421-433,
Dec. 1996
[16] K. Parhi and T. Nishitani, VLSI architectures for discrete wavelet transforms, IEEE
Trans. VLSI Systems, pp. 191-202, June 1993.
[17] C.Chakabarti, M. Vishwanath, and R. Owens, "Architectures for wavelet transforms:
a survey," Journal of VLSI Signal Processing, vol. 14, no. 2, pp.171-192, Nov. 1996.
[18] S. Mallat, “ A theory for multresolution signal decomposition: The wavelet
representation, IEEE Trans. Pattern Anal. And Machine Intell., vol. 11, no. 7, pp. 674-
693, July 1989.
[19] I. Daubechies, “Orthonomal bases of compactly supported wavelets,” Comm.
Pure Appl. Math, vol. 41, pp. 906-966, 1988.
[20] S. White, “Applications of distributed arithmetic to digital signal processing: a
tutorial”, In IEEE ASSP Magazine, pp. 4-19, July 1989.
[21] A. Oppenheim and R. Schafer, Discrete signal processing. New Jersy: Prentice