Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2009 MSc THESIS FPGA Hardware acceleration of co-occurring aberrations in aCGH data Marco R. van der Leije Abstract CE-MS-2009-17 Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This thesis proposes a partial FPGA based design were the number of parallel computations can be increased. The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the FPGA’s resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of several hundreds is achieved.
53
Embed
1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computer Engineering Mekelweg 4,
2628 CD Delft The
Netherlands
http://ce.et.tudelft.nl/
2009
MSc THESIS
FPGA Hardware acceleration of co-occurring
a b e r r a t i o n s in aCGH data
Marco R. van der Leije
Abstract
CE-MS-2009-17
Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This thesis proposes a partial FPGA based design were the number of parallel computations can be increased. The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the FPGA’s resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of several hundreds is achieved.
FPGA Hardware acceleration of co-occurring
aberrations in aCGH data
THESIS
submitted in partial fulfilment of the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER ENGINEERING
by
Marco R. van der Leije born in Rotterdam, The Netherlands
Computer Engineering
Department of Electrical Engineer ing
Faculty of Electrical Engineer ing, Ma t h e ma t i c s and Computer Science
Delft University of Technology
i
FPGA Hardware acceleration of co-occurring
aberrations in aCGH data
by Marco van der Leije
Abstract
Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor
cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in
DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This
thesis proposes a partial FPGA based design were the number of parallel computations can be increased.
The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based
Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the
FPGA resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of
several hundreds is achieved.
Laboratory : Computer Engineering
Codenumber : CE-MS-2009-17
Committee Members :
Advisor: Arjan J. van Genderen, CE, TU Delft
Advisor: Marcel J.T. Reinders, ICT, TU Delft
Chairperson: Koen Bertels, CE, TU Delft
Member: Georgi N. Gaydadjiev, CE, TU Delft
Member: Jeroen de Ridder, ICT, TU Delft
ii
iii
iv
v
Contents _
List of Figures vii
List of Table viii
Acknowledgements ix
1 Introduction 1 1.1 Aberrations in aCGH data .................................................................................................... 1 1.2 Problem statement ................................................................................................................ 2 1.3 Project goals and approach ................................................................................................... 3 1.4 Chapter overview.................................................................................................................. 3
2 The Algorithm 5
2.1 Inputs and outputs................................................................................................................. 5 2.2 Pair-wise space ..................................................................................................................... 6 2.3 Covariance and normalization .............................................................................................. 7 2.4 2D kernel .............................................................................................................................. 8 2.5 Peaks..................................................................................................................................... 9
3 Implementation in software 11
3.1 Pair-wise space ................................................................................................................... 11 3.2 Covariance and normalization ............................................................................................ 12 3.3 2D kernel ............................................................................................................................ 14 3.4 Peaks................................................................................................................................... 15
4.2.1 General purpose processor........................................................................................ 18 4.2.2 Cell (microprocessor) ............................................................................................... 18 4.2.3 Graphic Processor Unit ............................................................................................. 19 4.2.3 Field programmable gate array ................................................................................. 19 4.2.4 Conclusion ................................................................................................................ 20
5 Architecture 21
5.1 Partitioning of the algorithm............................................................................................... 21 5.2 Software versus hardware................................................................................................... 22 5.3 Software.............................................................................................................................. 23 5.4 Hardware ............................................................................................................................ 25
6 Implementation in hardware 27
6.1 Ethernet communication..................................................................................................... 27 6.1.1 The TEMAC and the FIFO’s .................................................................................... 27 6.1.2 Communication data format ..................................................................................... 28 6.1.3 The Ethernet controller ............................................................................................. 28
8 Conclusions and future work 37 8.1 Conclusion.......................................................................................................................... 37 8.2 Future work ........................................................................................................................ 38
Bibliography 39
vii
List of Figures _
Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene ...................................... 1 Figure 1.2: Tumor DNA’s from different persons............................................................................ 2 Figure 1.3: Co-occurrences alterations in tumor DNA ..................................................................... 3
Figure 2.1: The inputs....................................................................................................................... 5 Figure 2.2: The output ...................................................................................................................... 5 Figure 2.3: Pre computation (pseudo code) ...................................................................................... 6 Figure 2.4: Calculate the minimums................................................................................................. 7 Figure 2.5: Sums the C matrices ....................................................................................................... 7 Figure 2.6: The NORM matrix ......................................................................................................... 7 Figure 2.7: Normalize the pair-wise space ....................................................................................... 8 Figure 2.8: Normal 2D kernel convolution....................................................................................... 8 Figure 2.9: Separated 2D kernel convolution ................................................................................... 9 Figure 2.10: Finding peaks ............................................................................................................... 9
Figure 3.1: Pseudo code of calculating the pair-wise space ........................................................... 11 Figure 3.2: Pseudo code of the covariance and normalization step ................................................ 12 Figure 3.3: Pseudo code of calculating the covariance matrix........................................................ 13 Figure 3.4: Matlab code defining the kernel matrix........................................................................ 14 Figure 3.5: Pseudo code of the 2D kernel convolution................................................................... 15 Figure 3.6: Pseudo code of the peak finding algorithm .................................................................. 16 Figure 3.7: Pseudo code of the subroutine EXPAND..................................................................... 16
Figure 5.1: Pseudo code of to_fpga function .................................................................................. 22 Figure 5.2: Software architecture.................................................................................................... 23 Figure 5.3: Pseudo code of fpga function ....................................................................................... 23 Figure 5.4: New NORM function ................................................................................................... 24 Figure 5.5: Pseudo code of fpga_buffer function ........................................................................... 24 Figure 5.6: Total hardware architecture.......................................................................................... 25
Figure 6.1: Communication TEMAC-FIFO’s ................................................................................ 27 Figure 6.2: accelerator architecture................................................................................................. 29 Figure 6.3: Computational unit ....................................................................................................... 30 Figure 6.4: Receive process ............................................................................................................ 31 Figure 6.5: Calculate process (0 < g < G)....................................................................................... 31 Figure 6.7: Calculate Result Mins (left) and calculate Result Covar (right) .................................. 32 Figure 6.8: Calculate Data OUT ..................................................................................................... 32
Figure 7.1: Execution time of a small two small input matrices..................................................... 34 Figure 7.2: Execution time of one small and one big input matrix................................................. 34 Figure 7.3: FPGA execution time for different input sizes............................................................. 35
viii
List of Tables _
Figure 4.1: Number of arithmetic operations.................................................................................. 17 Figure 4.2: Arithmetic and time complexity................................................................................... 17
Figure 5.1: Sizes of the matrices..................................................................................................... 21
Figure 7.1: Resources used for the accelerator hardware ............................................................... 33 Figure 7.2: Errors between KCSMART v8 software and KCSMART v8...................................... 35
ix
Acknowledgements _
The report is the result of several months of work on improving the discussed algorithm. It is
a great challenge and many aspect of designing came along. Without some help this was very
hard to realize.
I like to thank Arjan van Genderen for his help and trust in me. He gave good advice and
he gave me much freedom in working at home. He also critically checked my first report,
where he has given some great advice. I also like to thank Marcel Reinders and Jeroen de
Ridder. They explained the algorithm and help me to stay on the right track. Finally I want to
thank Chris Klijn, because it was his matlab implementation. He gave insights in the
algorithm and explained all to get started.
Marco R. van der Leije
Delft, The Netherlands
August 20, 2009
x
1
Introduction 1
The report discusses an acceleration of a process that is used to find co-occurrences in
alterations in DNA strings. First some background information of these alterations is described
(1.1). To find the co-occurrences in these alterations, an algorithm has been designed. This
algorithm takes a lot of execution time, which leads to the problem statement (1.2). Thereafter
the project goals and approach are explained (1.3). Finally a chapter overview is given in
paragraph 1.4, which will explain the structure of this report.
1.1 Aberrations in aCGH data
Genomic instability is often observed in tumor cells [9]. This instability can lead to the loss of a
tumor suppressor gene and the gain of an oncogene, which is called an unbalanced transaction
(Figure 1.1). This means that deletions and additions in DNA pieces can occur. Where normally
the genes come in pairs, tumor cells have more (or less) of the same genes. This abnormal
number of genes is called copy number alterations (CNA’s). These alterations are interesting to
find, because this can help in tumor research and other DNA studies.
One of the procedures to measure the CNA’s is array Comparative Genomic Hybridization
(aCGH) [3]. This method actually compares ‘healthy DNA’ with ‘tumor DNA‘. So a ratio of
the number of genes between ‘healthy DNA’ and ‘tumor DNA‘ is calculated. This ratio is
usually represented in a Log2 form, where positive number represent a gain in the number of
genes and negative number represent a loss in the number of genes (compared to a ‘healthy
DNA’).
Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene
2
1.2 Problem statement
There are many analyses that focus to find CNA’s, but most of them are looking for single
location variations. In Figure 1.2 are three different tumor DNA’s displayed, where each arrow
represents the ratio (between the number of genes of healthy and tumor DNA). The analyses
search for shared alterations in the DNA. In this example these analyses will find a large
variation peak on position four (fourth arrow of all tumor DNA’s is high) and some smaller
variation peak on position 2.
However not all single DNA changes lead to tumors or other dangerous cell mutations. For
research purposes the need of finding co-occurrences in the DNA alterations is growing.
Therefore an algorithm was designed to find co-occurring aberrations [2]. In Figure 1.3 two
pieces of a DNA string are combined to find co-occurring alterations. The arrows represent the
same ratio as in Figure 1.2 and the size of the circle represents the importance of the co-
occurrence alteration of the position in DNA piece A and B. So the big circles are co-occurring
aberrations in one tumor DNA and these positions have to be determined for each tumor DNA.
The size of the circles is calculated with the minimum function. So a circle stands for the
minimum of two ratio’s between healthy and tumor DNA. The sum of all these circles for all
tumor DNA’s is called the pair-wise space. The algorithm will be further explained in chapter
two.
The problem is that the DNA string is large. This means many computations have to be
done to find alterations (the data is in terms of hundreds of megabytes). To find co-occurrences,
the number of computations per simultaneous alteration grows exponentially. Because the
algorithm is optimal in number of calculations (as far as known), there is a need to improve the
speed of these computations.
The total DNA string is divided, because the total computation is expensive to compute on
one platform. To divide the calculation, the tumor DNA strings are separated in chromosome
arms (Each person has 23 chromosome pairs and each chromosome consist of two chromosome
arms). These chromosome arms (chromatidis) are used to calculate a part of the total result. In
this way the problem is divided in roughly 10k jobs [1]. 46x46/2 chromosome arms are
compared for 5 kernel sizes and 3 modes (gain/gain, loss/loss and gain/loss), this result in
roughly 10k jobs.
Tumor DNA 1
Tumor DNA 2
Tumor DNA 3
Figure 1.2: Tumor DNA’s from different persons
3
DNA piece A
Figure 1.3: Co-occurrences alterations in tumor DNA
1.3 Project goals and approach
The main goal is to accelerate the computations of the algorithm. The approach is to consider
different platforms to do the computations and choose one of them. The algorithm is already
implemented in matlab and distributed over a network of computers. However this
implementation takes a long time to compute the result.
The approach is to improve one job (the co-occurrences aberrations of two chromosome
arms) on a different platform. In this way a new network can be created or combined with the
old network to achieve the same job, but faster. The approach will consist of the following
phases:
• Literature study: understanding the algorithm, background and studying other accelerator
approaches in software and on different platforms.
• Convert the matlab implementation to a plain C implementation to improve the algorithm
and to find parallelization possibilities.
• Search and choose an implementations platform and make an architecture.
• Implement, test en verify the architecture.
The goal of this report focuses on accelerating the algorithm that can find the co-occurrences of
two alterations in DNA. In the future more, co-occurrences will be calculated, however the
computational time grows exponentially (where the discussed algorithm was taking 10.000
days, when calculated fully).
1.4 Chapter overview
This report is written in the same order as the approach. Chapter 2 visualizes and explains the
algorithm. This algorithm is implemented in plain C, where some algorithm steps are
redesigned (chapter 3). An implementation platform (chapter 4) was chosen and an architecture
was designed (chapter 5). This architecture was implemented (chapter 6) and tested. The results
are mentioned in chapter 7. Finally the conclusions and future work can be found in chapter 8.
4
5
The Algorithm 2
This chapter explains the algorithm to calculate the co-occurring aberrations. First the exact
inputs and outputs of the algorithm are shown (2.1), then a description follows how the pair-
wise space is calculated (2.2). The next stage is to normalize the pair-wise space (2.3) and
finally the peaks are calculated (2.5) after the 2D kernel is applied to the normalized pair-wise
space (2.4).
2.1 Inputs and outputs
Within the algorithm two different chromosome arms for a number of tumor DNA’s from
different persons (P) are compared. These chromosome arms for the different tumor DNA’s are
displayed in Figure 2.1 and are called matrix A and B. So each row of matrix A and B contains
the ratio’s between healthy and tumor DNA for each tumor DNA (figure 1.2 is illustrative for
the matrices A and B). Matrix A (chromosome arm A) is of length M and matrix B
(chromosome arm B) is of length N. Some precomputed normalization vectors (NA and NB) are
also given and are used to normalize the pair-wise space, so all ratios have the same intensity.
Figure 2.1: The inputs
The output consists of the 500 highest peaks of the pair-wise space after a kernel is applied.
Only the 500 highest peaks are used because the smaller peaks have no significant value and
the output matrix would be too large. The given information includes the location (X is the
position in A and Y is the position in B) and the height of the peak in the pair-wise space
(Figure 2.2).
Figure 2.2: The output
6
The values in matrix A and B indicate the ratio between the tumor and healthy signal (in a log2
scale), where negative values indicate a loss and a positive values a gain. The algorithm can
calculate a gain/gain, loss/loss or gain/loss answer, which is selected with the ‘amp’ parameter.
All negative values are nullified for a gain situation. All positive values are nullified and all
negative values are inverted for a loss situation. All different possibilities are shown in Figure
2.3.
Figure 2.3: Pre computation (pseudo code)
2.2 Pair-wise space
To create the pair-wise space the two chromosome arms are combined. This is implemented
with a minimum function, because the common large gain or loss ratio is searched in both
chromosome arms. Each value of matrix A is compared with each value of matrix B for each
person. In this way a matrix C is computed for each person Figure 2.4.
switch amp
case 1 // gain/gain
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi
A A< => =
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj
B B< => =
case 0 // loss/loss
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi pi pi
A A else A A> => = = −
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj pj pj
B B else B B> => = = −
case 2 // gain/loss
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi
A A< => =
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj pj pj
B B else B B> => = = −
end;
7
Figure 2.4: Calculate the minimums
There are P different C matrices. Only high ratios that occur in all C matrices are important,
since systematic co-occurrence alterations in the chromosome arms are searched for. To find
these alterations all C matrices are added together (Figure 2.5). In this way each point in the
matrix D is the sum of the minimum of the ratios between the healthy and tumor DNA.
N
0
P
p=
=∑
Figure 2.5: Sums the C matrices
2.3 Covariance and normalization
The next step is to compute the covariance matrix to correct for continuous ratios. In addition
the normalization vectors are used to normalize the pair-wise space. The normalization matrix
is computed as shown in Figure 2.6. The matrix D is multiplied with the covariance matrix and
divided by the normalization matrix (Figure 2.7).
This covariance step looks like an easy computational step (in terms of multiplications and
divisions), but the covariance has the complexity of MxNxP (the same complexity as the
previous step). In addition some divisions are needed for the covariance matrix and
normalization.
N NN
Figure 2.6: The NORM matrix
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N and 1 ≤ p ≤ P
( , )p
ji pi pjC MIN A B=
8
Figure 2.7: Normalize the pair-wise space
2.4 2D kernel
A 2D Gaussian kernel is applied on the normalized pair-wise space to look at the local
enrichment of the highest values within this space. A normal 2D kernel convolution is shown in
Figure 2.8, where K denotes the height and width of the kernel matrix V. In this way the
complexity is MxNxKxK. So this step takes the most computational power (when KxK > P).
Because the kernel is a Gaussian kernel, this kernel is separable. This means that its
convolution can be done with one horizontal and one vertical vector of width K (Figure 2.9). In
this way the complexity decreases to MxNxKx2 and so this step takes less computational power
(when 2xK < P).
Figure 2.8: Normal 2D kernel convolution
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
/ji ji ji ji
E D COV NORM= ×
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
( / 2 ),( / 2 ) ,
1 1
K K
ji j K k 2 i K k1 k1 k 2
k1 k 2
G E V+ − + −
= =
= ×∑∑
9
E
m
= CONV( X V
K
) F
m
F
m
= CONV( X ) G
m
Figure 2.9: Separated 2D kernel convolution
2.5 Peaks
This is the final step in the algorithm and it finds peaks in the pair-wise space to detect the
DNA locations that co-aberate to a certain degree. This peak function, as shown in Figure 2.10,
makes an array of the location and the height of the peaks. This function should be in the
complexity of MxN and so it should be fast. But as it will discussed in the next chapter this
function took the most time in the matlab implementation.
Figure 2.10: Finding peaks
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
,( / 2 )
1
K
ji j i K k1 k1
k1
F E V+ −
=
= ×∑
( / 2 ),
1
K
ji j k k 2 i k 2
k 2
G F V+ −
=
= ×∑
10
11
Implementation in software 3
An important step in optimizing the algorithm is to get insight in the number of operations and
parallelism available each step. This is done by translating the algorithm in pseudo code. All
different steps in the algorithm, calculating the pair-wise space (3.1), the covariance and the
normalization (3.2), the 2D kernel (3.3) and the peak function (3.4), are translated into pseudo
code. Also the most important differences compared to the original matlab code are mentioned.
Also it must be noted that the matrices are stored column wise in memory (this means that the
indices for the matrices are exchanged).
The pseudo codes mentioned in this chapter are a simple version of the real C code. This
means that advanced code is left out of the report, like memory mapping for the kernel
convolution and flat peak detection.
3.1 Pair-wise space
In this step the pair-wise space is calculated. First a matrix D is filled with zeros and
then this matrix is used to add the minimum value between matrix A and B (Figure 3.1).
This pseudo code shows that MxNxP minimum (MIN) functions must be performed. In this
minimum function there has to be one comparator and all of these comparators can work in
parallel. Only the P numbers of additions for each point in matrix C have to write to the same
address. The original matlab code was running 5 times slower than this code because of the
matlab interpreter.
Figure 3.1: Pseudo code of calculating the pair-wise space
//Input: matrix A and B are the ratios between the healty and tumor DNA
// for chromosome arms A and B for P number of tumor DNA’s
//Output: matrix D is the pair-wise space
mins(M, N, P, matrix A, matrix B, matrix D)
{
for (i=0; i<M; i++)
for (j=0; j<N; j++)
D[i,j] = 0;
for (i=0; i<M; i++)
for (j=0; j<N; j++)
for (p=0; p<P; p++)
D[i,j] += MIN( A[i,p], B[j,p]);
}
12
3.2 Covariance and normalization
The function COVAR calculates the covariance matrix, where some subroutines are needed to
perform the matrix multiplication and calculate the mean for each column (Figure 3.3). After
the covariance matrix is calculated, it is multiplied with the pair-wise space with the function
COV_NORM (Figure 3.2). In this function all negative values in the covariance matrix are
nullified. The result of this multiplication step is divided by the normalization matrix. To avoid
division by zero it is checked in this function (if this occurs the result will be zero like the
original matlab code does).
Most computations are done in the matrix multiplication, which is used in the COVAR
function. The number of multiplications is MxNxP. The P additions for every point in the
matrix have to be performed sequentially, because they write to the same address.
Figure 3.2: Pseudo code of the covariance and normalization step
//Input: matrix COV is covariance matrix
// matrix D is the pair-wise space
// array Na and Nb are the normalization vectors
//Output: matrix E is the pair-wise space corrected for continuous ratios and