Low-Power Vectorial VLIW Architecture for Maximum Parallelism Exploitation of Dynamic Programming Algorithms Miguel Tairum Cruz Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisors: Dr. Nuno Filipe Valentim Roma Dr. Pedro Filipe Zeferino Tom´ as Examination Committee Chairperson: Dr. Nuno Cavaco Gomes Horta Supervisor: Dr. Nuno Filipe Valentim Roma Members of the Committee: Dr. Jo˜ ao Paulo de Castro Canas Ferreira October 2014
96
Embed
Low-Power Vectorial VLIW Architecture for Maximum Parallelism ... · Dynamic Programming, Data Level Parallelism, Instruction Level Parallelism, VLIW, Low-power iii. Resumo Os algoritmos
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Low-Power Vectorial VLIW Architecture for MaximumParallelism Exploitation of Dynamic Programming
Algorithms
Miguel Tairum Cruz
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Dr. Nuno Filipe Valentim Roma
Dr. Pedro Filipe Zeferino Tomas
Examination CommitteeChairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Nuno Filipe Valentim Roma
Members of the Committee: Dr. Joao Paulo de Castro Canas Ferreira
October 2014
Acknowledgments
The work presented herein was partially supported by national funds through Fundacao para a
Ciencia e Tecnologia (FCT) under project Threads (ref. PTDC/EEA-ELC/117329/2010).
First and foremost i would like to thank my parents and closest friends for their continued support
and motivation. I owe a huge debt of gratitude to my supervisors, Professors Nuno Roma and Pedro
Tomas for their continued support, guidance and motivation. A very special thanks goes to my colleague
Nuno Neves from the INESC-ID’s Signal Processing Systems group, for without his work mine would
have not been possible. I would also like to thank my colleague Joao Luıs Furtado from IST for his help
and insight in my work.
Abstract
Dynamic Programming algorithms are often used in many areas, to divide a complex problem into
several simpler sub-problems, with many dependencies. Typical approaches explore data level paral-
lelism, by relying on specialized vector instructions. However, the fully-parallelizable scheme is often
not compliant with the memory organization of general purpose processors, leading to a less optimal
parallelism, with worse performance. The proposed architecture exploits both data and instruction level
parallelism, by statically scheduling a bundle of instructions to several different vector execution units.
This achieves better performance than vector-only architectures, and has lower hardware requirements
and thus lower power consumption. Accordingly, performance and energy efficiency metrics were used
to benchmark the proposed architecture against a dual-issue, low power ARM Cortex-A9, a multiple-
issue, out-of-order high performance Intel Core i7 and a dedicated ASIP architecture. In a fair compar-
ison where all processors compute 128-bit vectors (or equivalent), the results show that the proposed
architecture can achieve up to 5.53x, 1.12x and 2.35x better performance-energy efficiency than the
ARM Cortex-A9, the Intel i7 and the dedicated ASIP, respectively, and a performance improvement of
up to 4.34x, 5.01x and 1.12x regarding the ARM, the dedicated ASIP and the Intel i7, respectively, for
the evaluated algorithm implementations.
Keywords
Dynamic Programming, Data Level Parallelism, Instruction Level Parallelism, VLIW, Low-power
iii
Resumo
Os algoritmos de programacao dinamica sao bastante usados em varias areas, dividindo um prob-
lema complexo em multiplos sub-problemas mais simples, com varias dependencias entre si. As abor-
dagens tıpicas exploram o paralelismo dos dados atraves de instrucoes vetoriais. No entanto, nos
processadores de uso geral, devido a organizacao da memoria existente, nao e possıvel paralelizar
completamente estes problemas eficientemente, resultando em piores desempenhos. A arquitetura
proposta explora tanto a paralelizacao dos dados como das instrucoes, agendando estaticamente um
conjunto de instrucoes para varias unidades de execucao diferentes. Isto permite alcancar um melhor
desempenho que as arquiteturas vetoriais, reduzindo os requisitos de hardware e levando a um menor
consumo de energia. Foram utilizadas metricas de desempenho e eficiencia energetica a fim de refer-
enciar a arquitetura proposta contra um ARM Cortex-A9 (com duplo-agendamento de instrucoes e baixo
consumo), um Intel Core i7 (com agendamento multiplo e alto desempenho) e uma arquitetura ASIP
dedicada. Atraves de uma comparacao justa com vetores de 128 bits, os resultados obtidos mostram
que a arquitetura proposta consegue alcancar uma relacao de desempenho e eficiencia energetica
ate 5,53x, 1,12x e 2,35x melhor que o ARM Cortex-A9, o Intel i7 e o ASIP dedicado, respetivamente.
Em termos de desempenho, a arquitetura proposta atinge resultados 4,34x, 5,01x e 1,12x superiores
aos do ARM, do ASIP dedicado e do Intel i7, respetivamente, para as implementacoes dos algoritmos
avaliados.
Palavras Chave
Programacao dinamica, Paralelizacao de dados, Paralelizacao de instrucoes, VLIW, Baixo consumo
DP is an algorithm methodology for solving complex problems, by dividing them into smaller sub-
problems that are simpler to solve. If these sub-problems are solvable and the optimal solution for
each sub-problem is found, the solution for the main problem can be realized through the sequence
of solutions of its sub-problems. This property is known as the optimal substructure property [15], and
problems that present it can be solved by DP. Another property that a problem must have to be solved as
a DP approach is that the space of sub-problems must be ”small”, in a sense that a recursive algorithm
for the problem solves the same sub-problems over and over, rather than always generating new sub-
problems. Contrary to recursive solutions, DP takes advantage of these overlapping sub-problems by
solving each sub-problem only once and then storing its solution. If the solution is later required, it
can be looked up instead of recomputed. DP thus uses additional memory to save computation time,
resulting in a time-memory tradeoff, where the savings often translate into an exponential-time solution
to be transformed into a polynomial-time solution.
There are usually two equivalent DP approaches that can be implemented: a top-down with memo-
ization and a bottom-up approach. The first uses a recursive method, by storing the intermediate results
corresponding to each sub-problem, returning the saved value when required (memoization), thus sav-
ing further computations at the given recursive level. The latter approach depends on the size of the
sub-problems, solving them in size order, with smallest first. Each sub-problem is solved only once, with
the guarantee that all the prerequisite (and smaller) sub-problems have already been solved. These two
approaches yield algorithms with the same asymptotic running times, with the bottom-up approach often
having much better constant factors, since it has less overhead for procedure calls.
DP algorithms are often represented in matrix form, where each cell corresponds to a sub-problem
depending on the adjacent cells (sub-problem dependencies). This results in a final matrix where the
last cell can only be computed after all the previous cells have been computed (optimal substructure
property). This representation allows for multiple independent cells to be processed in parallel, thus
increasing the performance.
These algorithms are then used in a wide variety of problems: Matrix chain multiplication, sequence
alignment, optimal binary search trees, shortest paths and others, as long as those problems present
optimal substructure and overlapping sub-problems. The following sections will detail specific DP prob-
lems and respective algorithms that were studied and used throughout the work.
2.2 Sequence Alignment
Bioinformatic applications have an essential role on molecular biology and related fields. Sequence
alignment algorithms, like the NW [5] or the SW [4], use DP methods to search for similarities between
DNA or protein sequences within large databases (eg. GenBank/EMBL/DDBJ [2]).
Depending on the type of alignment that is required, two algorithms can be used that apply DP
methods: SW, which outputs a local alignment; and the NW, which outputs a global alignment for any
6
two given sequences. A local alignment represents a region of greater similarity between the compared
sequences and is preferred when the query sequence (sequence to compare to a database) is smaller
than the database sequence. The global alignment method, on the other hand, spans the entire query
sequence in attempt to align every symbol in the sequence with the whole database sequence. This is
useful when comparing sequences of about the same size, that are known to be similar (DNA or protein
sequences with similar functions).
Besides the NW and the SW, there are also other sequence alignment algorithms based on HMMs.
HMMs are stochastic models that model a process where the future states of that given process depend
only on the present state and not on the complete sequence of states that preceded them. In addition,
the states are hidden from the observer, which has only the information regarding the observed outputs
that were generated by the hidden sequence of states.
In particular, the Viterbi algorithm [8] is a DP algorithm used to solve HMM problems, returning the
most probable state sequence that originated the observed sequence of outputs. Although being in a
different family of DP algorithms, Viterbi shares many properties with the sequence alignment algorithms
mentioned before (NW and SW)[16].
Although the alignment algorithms mentioned above result in optimal alignments (being global or
local), there are other commonly used tools in the field based on faster heuristic approaches (instead of
DP approaches) with reduced complexity, implemented in GPPs. Some examples are the the BLAST [6],
FASTA [7] and HMMER ([16], [9]) tools. However, these tools can only guarantee a good approximate
alignment and not always the best, often requiring a later passage of a more complex DP algorithm (like
the SW or Viterbi) for better results.
2.2.1 Needleman-Wunsch Algorithm
The NW algorithm [5] is a DP algorithm for computing the global alignment between a query and
database reference sequences. The resulting score represents the best alignment between the com-
pared sequences (a query sequence Q of size n and a database sequence D of size m) and is based on
a substitution score matrix Sm (which defines the scores given to substitution mutations), a gap penalty
α (corresponds to a negative score given to an insertion or deletion mutation) and a recurring relation
that computes the resulting score matrix H (see equation (2.2)). This algorithm takes O(nm) time to
complete.
H(i, 0) = α ∗ i
H(0, j) = α ∗ i(2.1)
Hi,j = max
Hi−1,j−1 + Sm(qi, dj)Hi−1,j + αHi,j−1 + α
(2.2)
From the equations above, it can be seen that each cell in the resulting H matrix has three depen-
dencies in its computation: the cell at its left position (horizontal dependency); the cell at its top position
7
(vertical dependency); and the cell at its top left position (diagonal dependency). The scores given by
the vertical and horizontal dependencies are subtracted by the gap cost and correspond to an insertion
or deletion in the alignment. The score given by the diagonal dependency is added to the substitution
score matrix for the current cell and corresponds to a match or mismatch in the alignment. The maxi-
mum of these 3 values will be the final cell value. Figures 2.1 (a) and (b) show an example of the NW
algorithm for two small DNA sequences.
(a) First iterationof the NWalgorithm.
(b) Last iterationof the NWalgorithm.
(c) First iteration ofthe tracebackphase.
(d) Last iteration ofthe tracebackphase.
Figure 2.1: Example of the NW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -ACC : CACT ].
After the H matrix is computed, the last cell entry (Hn,m) presents the maximum score among all
possible alignments. To compute the actual alignment, a traceback algorithm starting in this maximum
score cell is computed (see figures 2.1 (c) and (d)). This traceback algorithm compares the three de-
pendencies of the cell currently being computed, to see which one of them was the source to the current
cell result. The chosen cell then becomes part of the alignment sequence and the traceback algorithm
repeats this process, now for the chosen cell. When the first cell of the H matrix (H0,0) is reached, the
traceback ends.
Different alignment sequences can be found whenever there is more than 1 possible cell to choose
from during the traceback. This happens when, during the score computation in the NW algorithm, there
is more than one maximum result in the main recursion, ie., the cell has more than one source.
The NW algorithm is rarely used when the sequences under comparison have different sizes, since
the resulting alignment would be dominated by gaps. This happens because the global alignment tries
to align whole sequences, while the local alignment only tries to align similar regions, thus performing
much better with different sequence sizes. Since different sized sequences are commonly used, the NW
algorithm does not get as much exposure as the SW algorithm, resulting in less implementations for it.
2.2.2 Smith-Waterman Algorithm
The SW algorithm is a DP algorithm for computing the optimal local alignment score between a
query and a reference sequence. The resulting score represents the degree of similarity between the
sequences and, similarly to the NW algorithm, it is based on a substitution score matrix and a gap-
penalty function. The algorithm was proposed by Smith and Waterman [4] and was improved later by
8
Gotoh [10] for multiple sized gap penalties, having an O(nm) time complexity, where n and m are the
sizes of the query (Q) and reference (D) sequences, respectively.
Given a substitution score matrix Sm, a negative gap-open penalty α and a negative gap extension
penalty β, the score matrix H can be computed by the following recursive relations:
Hi,j = max
0Ei,j
Fi,j
Hi−1,j−1 + Sm(qi, dj)
(2.3)
H(i, 0) = H(0, j) = 0
The terms Ei,j and Fi,j are defined in equations (2.4) and (2.5), respectively. Ei,j corresponds to the
scores ending with a gap in the reference sequence (horizontal dependency), while Fi,j corresponds
to the scores ending with a gap in the query sequence (vertical dependency). Accordingly, Hi,j repre-
sents the local alignment score involving the first i symbols of Q and the first j symbols of D (diagonal
dependency).
Ei,j = max
{Ei,j−1 + βHi,j−1 + α
(2.4)
E(i, 0) = E(0, j) = 0
Fi,j = max
{Fi−1,j + βHi−1,j + α
(2.5)
F (i, 0) = F (0, j) = 0
These relations are very similar to the NW algorithm. In fact, each cell still has the three dependen-
cies in its computation (horizontal, vertical and diagonal) with the horizontal and vertical dependencies
representing insertions or deletions in the alignment, and the diagonal dependency representing a match
or mismatch between the sequence symbols.
The only major difference is the fact that the H cell values do not go below zero. This will result in
the maximum cell value in the H matrix not being necessarily the last position (Hn,m), as in the NW
algorithm. Figures 2.2 (a) and (b) show this example of the SW algorithm for two small DNA sequences.
(a) First iterationof the SWalgorithm.
(b) Last iterationof the SWalgorithm.
(c) First iteration ofthe tracebackphase.
(d) Last iteration ofthe tracebackphase.
Figure 2.2: Example of the SW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -AC : CAC ].
9
The traceback phase (see figures 2.2 (c) and (d)) will then start in the H matrix cell that has the
highest score value, and will continue its computation along the sources for each considered cell, until it
reaches a zero valued cell, instead of stopping only on the first position of the H matrix (H0,0), as in the
NW algorithm.
Both these two differences in the score computation and traceback parts of the algorithm, will result
in the most similar region between the two compared sequences, i.e., the local alignment.
2.2.3 Hidden Markov Models
A Markov model consists in a stochastic model, where the future states of a process depend only on
the present state and not on the complete sequence of states that preceded it. This particular property
can be expressed by equation (2.6) for a given state sequence {w1, w2, ..., wn} [18].
P (w1, ..., wn) =
n∏i=1
P (w1|wi−1) (2.6)
A particular Markov model is the Hidden Markov Model (HMM) [19] where some (or all) states are
hidden from the observer. In a HMM, the observer has only the information regarding the sequence of
outputs that were generated by a hidden sequence of states.
An alternative mathematical expression for the HMM can be deduced by applying Bayes’ rule for a
given state sequence {w1, w2, ..., wn} and an output (observations) sequence {u1, u2, ..., un}:
where P (w1, ..., wn) is the probability of a given state sequence, P (u1, ..., un) is the prior probability of
seeing a particular sequence of outputs, P (u1, ..., un|w1, ..., wn) is the probability of observing the output
for a particular state and P (w1, ..., wn|u1, ..., un) is the probability of the future state, given the current
output observation and it is the one that HMM pretends to find.
Two different tasks, with different outputs, can be performed on the HMMs: decoding and generation.
The first outputs the path of states that is more likely to have generated a given output sequence, and
its corresponding probability. The latter presents the likelihood probability of a given sequence being
generated by the model. The decoding task is computed by the Viterbi algorithm [8], which computes the
most probable state to generate each new output observation, for all the available states. The generation
task is computed by a similar algorithm, the Forward algorithm, which calculates a progressive sum of
the probabilities of all previous state paths for each new observation, resulting in a final probability
consisting in the sum of all final probabilities of all the states.
Although the above description of HMMs refers to a process of single alignment (one sequence
against another), the algorithms mentioned above are used in real applications for searching similar
sequences in a database, and thus require a method to search and compare a group of sequences
10
against a database (instead of only one). This method is achieved by creating alignment profiles, which
highlight the family’s sequences common features and effectively model an entire sequence family (see
figure 2.3). These profiles are usually generated by an initial multiple alignment, followed by a proba-
bilistic breakdown of the elements present in each position.
With alignment profiles, a query can now be compared against a family of sequences (profile), thus
greatly reducing the computational cost. Furthermore, a profile gives a more correct representation of
the defining characteristics of a family, by weighing the elements in proportion to their actual frequency
(and thus importance) in the underlying family.
Alignment
A T C C A G C T
G G G C A A C T
A T G G A T C T
A A G C A A C C
A T G C C A T T
A T G G C A C T
Profi le
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
Figure 2.3: Example of a Consensus Profile, derived from a multiple alignment of a family of similarsequences.
Since the Viterbi algorithm gives the most likely path of states to generate a given sequence and
its corresponding probability, it is suitable for computing sequence alignment problems. However, the
Forward algorithm can only indicate the likelihood of the query sequence belonging to a family of se-
quences. This way, and given that both algorithms are very similar, our work will focus on the Viterbi
Algorithm.
2.2.3.A Profile Hidden Markov Models
As previously referred, HMMs can be used to statistically model the distribution of sequence elements
in a profile, by determining the probability of each element in each position of the family’s sequences as
the emission probability in each state. Thus, a Profile HMM can be used to compute the probability of
database sequences being generated by a given query, ie., align the query sequence to a database.
The construction of the Profile HMM model starts by modeling a global alignment (with no gaps)
to a succession of consecutive matching states, where each state corresponds to each column of the
profile sequences. Each matching states is also accompanied with emission probabilities, since they
emit the alignment symbols. These probabilities are derived from the relative frequencies of symbols in
the family’s sequences, at each column (see figure 2.4 (a)).
Then, insertion states are added to the model to represent gaps, i.e., portions of sequences that
do not match anything in the previous model (with only the match states). Since insertions can occur
at any point in the model, there is an Insert state for every pair of Match states (see figure 2.4 (b)).
11
B EM1 M2 M3 M4
(a) Example of a HMM composed solely of matchingstates, allowing for ungapped global alignment.
B EM1 M2 M3 M4
I2 I3 I4I1
(b) Example of a HMM that allows arbitrary inser-tions.
Figure 2.4: Example of the construction of a Profile HMM, starting with the match states (a) and with theaddition of insert states (b), with the respective state connections.
Furthermore, in order to support the affine gap model, the insert states must also have a loop to allow
for long inserted regions. The probabilities for entering an Insert state the first time versus staying in the
Insert state can also be different and, since they are arbitrary, are usually set to equal to the background
probabilities in a profile.
Finally, Deletion states are added to the model. These states represent portions of the profile that
are not matched by the sequence, and thus do not emit any output symbol. Naturally, the Delete - Delete
state transition corresponds to gap-extend costs, thus completing the Profile HMM model [20] (see figure
2.5).
M1 M2
I1
D1
I2
D2
M3
I3
D3
I0
M4
I4
D4
B
E
Figure 2.5: HMM for the optimal gapped global alignment (additional transitions from insert states todelete states, and vice-versa, are included for the sake of correctness, although these transitions areusually very improbable and have a negligible effect).
The profile HMMs can also be extended to support local alignment. This can be done by adding
two special flanking states that delimit the sub-region of the local alignment (States B and E), and two
self-looping flanking states (states N and C) that precede or follow the flanking states [19] (see figure
2.6). The flanking regions correspond to the unmatched regions of the aligning sequence and so, in
order to capture a local alignment, it is only required to add these two regions as self-looping states
with transitions from and to each match state. These new states also emit tokens with a probability
distribution, which can be set to the background random distribution of the profile.
Finally, in order to support multihit alignments, ie., multiple local alignments, another special state
is required (state J), which connects the flanking states B and E. This new state has jump and loop
probabilities, in order to cover the unmatched region between two local alignments (see figure 2.7).
12
M1 M2
I1 I2
D2
M3
I3
D3
M4
I4
D4
B E T
M5
CNS
Figure 2.6: Profile HMM for unithit local alignment.
M1 M2
I1 I2
D2
M3
I3
D3
M4
I4
D4
B E T
M5
J
CNS
Figure 2.7: Profile HMM for multihit local alignment.
2.2.3.B Viterbi Algorithm
As previously stated, Viterbi’s algorithm [8] is a DP algorithm that finds the most likely sequence path
of hidden states in a HMM (or a Profile HMM), for a given sequence of observed outputs. The required
inputs for the algorithm are:
• State Space (S): Vector with all the possible states.
• Observation Space (O): Vector with all the observed outputs.
• Observation Sequence (Yi): Vector with the sequence of observed outputs.
• Initial Probabilities (πi): Vector of initial state probabilities.
13
• Transition probabilities (Ti,j): Matrix of transition probabilities from the state i to state j.
• Emission probabilities (Qi,j): Matrix with the probabilities of observing the output i given the state
j.
Given the inputs above, the algorithm can compute the most probable state sequence {x1, ..., xT }
that originated the observed outputs {y1, ..., yT }, by using the following recursive relations:
Vt,k =
{Qyt,k × πk t = 1Qyt,k ×maxx∈S(Tx,k ∗ Vt−1,x) t 6= 1
(2.9)
where Vt,k is the probability of the most likely hidden state sequence, responsible for the first t observa-
tions, having k as its final state.
From the relations in equation 2.9, it can be seen that during the first iteration of the algorithm (t = 1),
the probability of the first hidden state only depends on the initial probabilities and on the emission matrix,
since there are no previous states yet.
For the remaining iterations (t 6= 1), the computations for every hidden state k (which will become
the xt after the computation) will now require, for every x state in S, the probabilities of its preceding
state xt−1, as well as the emission probability Q of observing the output given the state k, and the T
probability of passing from the preceding state xt−1 to the current one being computed (k).
This wields a time complexity of O(TS2). Figure 2.8 shows an example of the Viterbi algorithm for a
HMM with 2 states.
Start
S00.2
S10.08
S00.098
S10.019
Observation 1 Observation 2
S10.1
S00.051
Observation 3
Figure 2.8: Trellis diagram for a sequence of three observations in the Viterbi algorithm. Each state hasthe correspondent probability value. The T and Q values are not depicted. The red path corresponds tothe most probable state sequence.
From a DP perspective, the main problem can be seen as the full sequence of hidden states that
Viterbi tries to find, while the sub-problems are the computations of the probabilities of every hidden
state along the sequence, since they only depend on their previous state.
Since the algorithm only stores information regarding the previous state, a back pointer for the pre-
ceding state should also be used, in order to retrieve the final state sequence path at the end of the
computation (traceback).
14
2.2.4 Comparison between profile HMMs and single alignment algorithms
The Viterbi algorithm, used for profile HMMs, is very similar to the SW algorithm, presented previ-
ously. In fact, when using profile HMMs for solving sequence alignment problems, both algorithms have
the same recursive dependencies, with little differences. The following equations represent the applica-
tion of the Viterbi algorithm using a notation similar to the one that was adopted for the SW algorithm.
Mi,j = log eMj(xi) +max
Bi−1 + log tBj−1Mj
Mi−1,j−1 + log tMj−1Mj
Ii−1,j−1 + log tIj−1Mj
Di−1,j−1 + log tDj−1Mj
(2.10)
Di,j = max
{Mi,j−1 + log tMj−1Dj
D1,j−1 + log tDj−1Dj
(2.11)
Ii,j = max
{Mi−1,j + log tMjIj
Ii−1,j + log tIjIj(2.12)
Where the M , D and I state values correspond to the the H, E and F values in the SW equations,
respectively. The B state corresponds to a special state that was previously explained, and will be
omitted here for comparison purposes.
The above equations are represented in log-space, not only to eliminate the multiplications, but also
to provide better accuracy. The log eMj(xi) and log tXjYj
thus correspond to the emission and transition
scores from Xj to Yj , respectively, which are already pre-computed scores present in the profile. The
transition values can be compared to the gap scores in the SW algorithm, but with an added delay,
since they depend on the position-specific transition states, requiring a previous look up. Additionally,
the M emission values roughly correspond to the substitution score matrix in the SW algorithm, varying
accordingly to the current Match state Mj and sequence symbol. Thus, they are already in a model-
specific profile, and can be re-used between sequences, essentially acting like a Query-Specific Profile
for the SW algorithm.
Apart from the differences stated above, the M state will require the Ii−1,j−1 and Di−1,j−1 states,
instead of the I and D states computed in the current iteration, as it happens for the SW in equation
(2.3). This will increase the amount of registers and memory that is required for the dependency values,
since all dependencies must now be stored to be later used in a future iteration. This will lead to a
reorganization of the algorithm computations, by having delayed loads and stores of the M state values,
where the dependencies required for the M state are only updated after the new M state values is
computed.
2.3 Implementation of DP Algorithms
When solving a problem using a DP-based approach, it is first necessary to decompose it into a set
of smaller sub-problems. This translates into computing the value of each cell in a n-dimensional matrix
15
by relying on the value of pre-computed adjacent cells. Given the usually large length of these matrices,
it is of utmost importance to implement additional methods to parallelize and to speed up DP algorithms.
Since the sub-problems are usually independent, DLP is often implemented in order to maximize the
number of independent cells (sub-problems) computed in each iteration of the algorithms. Additionally,
ILP can be also used with DLP in order to minimize the hardware impact brought by DLP, achieving
better performances with a lower power usage. In current architectures, however, it is not often possible
to conciliate both these parallelism paradigms at full extent, given the memory accesses or other pre-
existent structural architectural designs.
The following sections will cover both the DLP and ILP paradigms, as well as some state-of-the-art
architectures, both programmable and dedicated.
2.3.1 Data Level Parallelism
As previously mentioned, DLP is widely exploited in DP algorithms since most of the data to compute
is independent. This means that, in any given time during computations, it is possible to operate different
data elements simultaneously, i.e., operate over a vector of elements. Given the matrix-like represen-
tation of DP algorithms, this translates into a vector of cells. Using a 2D-matrix as an example, and
given the three data dependencies (vertical, horizontal and diagonal) found in the sequence alignment
algorithms previously mentioned, it is possible to see that the only possible vector, in order to have only
independent cells, is composed by the cells along the anti-diagonal, as depicted in figure 2.9. Any other
vector composition would result in data hazards, as the dependencies required for a given cell would
not be calculated before that cell. Since the sub-problems in a DP algorithm usually have the same set
of operations applied to them, each data vector would only require one set of operations applied to it, in
order to calculate all its containing cells. Ideally, this would result in a speedup equal to the number of
cells in each vector, in comparison to a single data architecture.
Figure 2.9: Example of DP Cell parallelism.
Most current processors already include vectorial instruction sets, like Streaming SIMD Extension
(SSE) or Advanced Vector Extension (AVX), making DLP an easy and viable option. For this reason,
and given the performance boost it gives, DLP is used in most GPP architectures for DP algorithms, as
well as in dedicated architectures.
16
2.3.2 Instruction Level Parallelism
When compared to DLP, ILP does not have a great impact in DP algorithms. Since the different sub-
problems often require the same set of operations to be solved, there is no particular need for different
operations being concurrently computed, as it even may lead to racing conditions and eventually data
or structural hazards. However, if there is a guarantee of no racing conditions (and thus no hazards)
while computing different cells at different operation steps of the computation, ILP can potentially reduce
the required hardware used by a DLP-only solution. In fact, vectorial DP algorithms do not make the
best usage of the available hardware, since, at any given time, only one type of operation is being
computed on a given vector. With the addition of ILP it is possible to have, at any given time, a different
subsets of cells computing different operations. This way, the length and number of functional units
in the architecture would be reduced, promoting a better hardware usage, since more functional units
would be working at the same time. The solution just described is the one implemented by VLIW
architectures, where a larger single instruction is issued, instructing different operations to different
data elements in parallel. Although this solution seems attractive from an hardware requirements point
of view, VLIW architectures are not that common. Most current processors have however different
ILP mechanisms, where the work of extracting the ILP is of the compiler’s responsibility. Instruction
pipelining, out-of-order execution or branch prediction are just some of the methods that are frequently
present in most processors, which are used alongside the DLP extensions to maximize the performance
of DP algorithms.
2.3.3 State of the Art Architectures
Hardware architectures can be divided into programmable and non programmable architectures.
Programmable architectures present greater flexibility when compared with the non programmable, since
they are easily adapted to new types of problems and algorithms. They are commonly found in GPPs,
which are used for working with a wide spectrum of applications.
Non programmable architectures, on the other hand, are usually implemented in dedicated hardware,
such as ASICs or FPGAs, and are commonly used to solve a specific task or family of similar tasks. This
type of architectures are mainly designed for speed and optimization, often resulting in high performance
with a low power consumption, but also with a higher complexity and implementation costs than the
programmable architectures.
In particular, FPGAs are regarded as a hardware alternative to the GPPs and ASICs, by balancing
the flexibility often found in GPPs with the performance often found in more dedicated architectures.
They allow reconfigurable designs with a smaller design cycle, although not achieving the high perfor-
mance offered by an ASIC nor the programability of the GPP. However, when compared to the other
architectures, the FPGA still offers a better performance-programability ratio value depending on the
applications to be implemented.
Due to their large computational times, DP algorithms require fast implementations in order to keep
up with the growing size of the sequences being considered in the sequence alignment problems. Al-
17
though the commonly used (but sub-optimal) tools for these types of problems (BLAST [6], FASTA [7],
HMMER [9]) are implemented in GPPs (due to its flexibility), many dedicated architectures have arisen,
bringing faster algorithm computations to the table.
The following sections will give an overview of the several programmable and non programmable
architectures for the implementation of sequence alignment DP algorithms that were presented in the
previous sections.
2.3.3.A Programmable Architectures
Vector architectures exploit data level parallelism by implementing high-level operations that work on
linear arrays of data instead of individual data items. The vector elements do not have dependencies
between them, ensuring that no data hazards occur.
Nowadays, most commercial processors have support for vector instructions, like the SSE or AVX ex-
tensions in Intel processors [21], containing dedicated registers and functional units for those particular
instructions. These instructions are classified as Single-Instruction Multiple-Data (SIMD) instructions.
Due to their ability to exploit parallelism, this type of architectures is often used for DP algorithm
implementations, which usually require a large quantity of parallelizable computations.
Smith-Waterman
By reviewing the implementation of the SW algorithm presented before, it is possible to observe that, for
the computation of the final score matrix H, the only cells that do not have dependencies between them
are the ones along the anti-diagonal (see figure 2.11(a)). This allows an inner loop parallel processing
of vectors composed by the anti-diagonal values of the H matrix and was first proposed by Wozniak
[11]. Although the loops are fully parallelizable, this parallelization scheme has the drawback of difficult
memory acess patterns, introducing large overheads in data manipulation when implemented on GPPs.
Rognes and Seeberg [13] improved on Wozniak work by pre-computing a query profile (figure 2.10
(a)) once for the entire database sequence. This query profile indexes a modified substitution score ma-
trix by the query sequence position and the database sequence symbol, instead of the original matrix by
the query sequence symbol and the database sequence symbol (figure 2.10 (b)). For a given database
symbol, the resulting score for matching it with all the query sequence symbols is stored sequentially in
one column of the matrix with other columns corresponding to other database symbols.
When implemented with Intel’s AVX/SSE instructions, the vector elements are composed of cells
parallel to the query sequence, instead of cells along the anti diagonals (see figure 2.11(b)). This vali-
dates the use of the query profile, but has the disadvantage of introducing data dependencies between
the cells of the vector. It also introduces conditional branches in the inner loop for the computation of the
F term (see equation (2.5)) when data dependencies occur. SWAT optimization [22] of this procedure
tries to minimize the impact caused by these inter-vector dependencies. This optimization assumes that
the E and F terms are often equal to zero, hence not contributing to the score value H. In fact, it was
demonstrated that as long as H is not larger than the threshold α + β (respectively, the gap open and
gap extensions penalties), E and F will remain zero along the column and row of the matrix, eliminating
18
C A C T
A -1 2 -1 -1
C 2 -1 2 -1
C 2 -1 2 -1
Database Sequence
Query Sequence
(a) Query-profiled substitution scorematrix.
A C T G
A 2 -1 -1 -1
C -1 2 -1 -1
T -1 -1 2 -1
G -1 -1 -1 2
Database Symbols
Query Symbols
(b) Substitution score matrix.
Figure 2.10: Comparison between a substitution score matrix with and without query profiling for theDNA sequences [AAC] and [CACT]. Note that the depicted DNA sequences may be composed by 4different symbols (A,C,T and G).
data hazards in the parallel computation of the vector elements. When this verification is not true, data
dependencies may arise and the cells will take a more time consuming computation process.
Farrar [12] tackles this problem by also organizing the SIMD registers in parallel to the query se-
quence (just as Rognes), but accessing them in a striped pattern (See figure 2.11(c)). This modified
access pattern moves the conditional branches of the vertical dependencies to a lazy loop, executed
outside the inner loop of the algorithm. This way, the conditional branches only have to be taken into
account once for every database symbol. After the completion of the inner loop, a first pass is made to
check the values of F , for each of the query segments against the values of H for the given database
symbol. A second pass - the lazy loop - is only needed when the values of F are greater than the values
The proposed architecture targets the simultaneous exploitation of DLP and ILP paradigms, in order
to position itself as a faster solution than the current DP solving architectures, and to support a broader
range of algorithms.
As it was showed in the previous chapter, DP problems can be translated into a n-dimensional matrix,
where each sub-problem corresponds to a cell in the matrix, with adjacent cells as prerequisite sub-
problems. In a 2D matrix, this typically results in horizontal, vertical and diagonal data dependencies
from the left, top and top-left cells, respectively, for each cell computation. Accordingly, to maximize
the processing efficiency and to minimize the number of dependencies, cell computations should be
performed in parallel along the anti-diagonal (see figure 2.9).
However, exploiting the DLP along the anti-diagonal brings two problems: harder memory organiza-
tion/access (visible in Wozniak’s [11] implementation of the SW algorithm) and larger hardware require-
ments. While the former can be solved by implementing specialized memory-access units to gather cell
values in non-adjacent memory positions, the latter requires the consideration of a different type of par-
allelism. In fact, vector-only solutions will always result in low Functional Unit (FU) usage. For example,
consider that vector processing is used to compute the value of N cells in parallel, which requires a
total of M vector instructions. Assuming the inexistence of some inherent data dependency and that
only one of these operations is a square root, an utilization of 1/M is expected for all N parallel square
root FUs. Naturally, it is possible to reduce the number of FUs (hardware requirements) by serializing
the operation on the different vector elements. However this solution trades performance for hardware
requirements, hence it is not ideal.
The alternative is to also explore ILP, by assigning different operations to be executed on different
parts of the vector. This results into having multiple independent units, each computing their own vector
operation, in a given part of the vector, which not only increases the potential for additional parallelism,
but also reduces the hardware requirements and increases the utilization rate. This is the paradigm
used by VLIW architectures, and can be seen side by side to a vector-only approach, as it is illustrated
in figure 3.1. In this figure, although it takes one more clock cycle for the VLIW architecture to compute
the 2 instructions for all the 4 elements, it must be taken into account that the example only depicts two
instructions. In fact, the ILP introduced here will only have an impact in the initialization of the algorithm,
resulting, for the stationary phase of the algorithm, in the same number of clock cycles as the vector-
only approach. The use of ILP is also supported by the set of common steps usually included in a DP
algorithm. Usually, these steps consist in dependency loads followed by cell computations, finalizing with
the store of the results. Assigning these different steps of the algorithm to different cells in the matrix
(along the anti-diagonal) validates the ILP (different instructions operating over different cells) while also
maintaining data coherence, given the independence between the cells. The only control requirement
is to guarantee that the cell dependencies are always computed in advance to the the cells that require
them, in order to avoid data hazards.
To efficiently support DP algorithms and to simultaneously explore DLP and ILP, the proposed archi-
24
SqrtSqrtSqrtSqrt
Functional Units
...
Vector VLIW
+/-
SumSumSumSum
+/- +/- +/-
√√ √ √
Logic Logic Logic Logic
x x x x
SqrtSqrtSumSum
Functional Units
...
+/-
SumSum
+/-
√√
Logic Logic
x x
SqrtSqrt
Figure 3.1: Comparison between a Vector architecture composed of 4 elements, and an equivalent VLIWarchitecture with 4 units composed of 1 element each. Both examples compute a square root operation,followed by a sum operation. In the Vectorial approach, two clock cycles are required to compute thetwo instructions, which use 4 FUs each (colored FUs). The VLIW approach takes 3 clocks cycles tocompute the two instructions, but only requires a maximum of two FUs per instruction (colored FUs).This is achieved by delaying the two last units in order to reduce the number of FUs.
tecture must then comply with the following requisites: Independent execution units to compute indepen-
dent instructions in parallel, issued from an instruction bundle; and a Data Stream Unit (DSU) to access
the memory concurrently with the execution units (to reduce the latency brought by non-adjacent mem-
ory accesses). Each execution unit will then be assigned a different vector of cells and an independent
register bank, in order to operate independently of the other units. This also enables the exploitation of
memoization, a technique where the current algorithm iteration re-uses the results obtained in the previ-
ous iteration, which are stored in the register bank. This technique is specially useful in DP algorithms,
where the sub-problems only depend on the result of previous sub-problems, which are represented by
the previous iteration (or series of iterations) results. This also reduces the number of required memory
accesses and thus increases the performance. However, there is still the need to ensure the communi-
cation between the different execution units, either because of a data dependency that is in a different
unit, or simply because it would greatly improve an algorithm to have shared values between units. The
usage of memory to share values would prove inefficient. Accordingly, this will be achieved with the ad-
dition of a small group of shared registers and sniffing mechanisms between a small subset of registers
in the register bank of each execution unit.
Finally, and given the amount of data that is required to be loaded and stored from memory, the
architecture will also present a RAM memory, as well as a smaller and local memory to store constants
that are often used during algorithm computations. The existence of these two memories can further
help to reduce memory congestion, specially in memory heavy algorithms, like Viterbi.
25
3.2 Proposed Architecture
As it was previously referred, there are several ways to explore ILP alongside DLP. In the proposed
architecture, static ILP is explored, since it requires less control hardware, thus accomplishing better
energy efficiency. This is achieved by issuing an instruction bundle that is composed of several different
instructions, each operating over a vector of independent elements (DLP) in different execution units.
This way, instead of using a single large vector computing the same instruction (as it is typical in vec-
tor architectures), the architecture has several smaller vectors, each effectively computing a different
instruction.
In DP algorithms, this will correspond to the parallel processing of cells that are in different steps
of the algorithm, in order to maximize the parallelism and thus, to reduce the required hardware. This
parallelism must be done cautiously, since it is prone to introduce data race conditions if certain con-
ditions are not met. However, only two conditions are required to be met, in order to avoid data race
conditions: all cells currently being processed must be independent; and if there are cells that are being
processed in advance (computing a further instruction in the algorithm), there must never be dependen-
cies to the cells that still are in previous processing steps of the algorithm. By using the anti-diagonal
parallelism as an example, this second condition is met by ensuring the cells being processed at the
most down-left section of the anti-diagonal, are in advance regarding the cells at the top-right section
of the anti-diagonal, since the dependencies propagate from the left to the right and from the top to the
bottom (see figure 3.2).
1
2 1
3 2 1
1 3 2 1
2 1 3 2
3 2 1 3
3 2 1
3 2
3
Qu
ery
i+1
Qu
ery
i
Unit 0 Unit 1 Unit 2 Unit 3
Ref 0 Ref 1 Ref 2 Ref 3
1
2 1
3 2 1
1 3 2 1
2 1 3 2
3 2 1 3
3 2 1
3 2
3
Qu
ery
i+1
Qu
ery
i
Unit 0 Unit 1 Unit 2 Unit 3
Ref 0 Ref 1 Ref 2 Ref 3
Clock Cycle
Computation of one Cell
Figure 3.2: Example of two iterations of a DP algorithm with 3 instructions per iteration and 4 cells beingprocessed along the anti-diagonal. Each cell is processed in a different execution unit with a differentreference symbol, which are represented by the columns. Each group of 3 instructions correspondsto one cell computation, for a given query-reference pair. The dependencies between symbols arerepresented by the arrows, while the rows represent clock cycles. Unit 0 is the most advanced unit.
26
To compute each instruction that integrates the instruction bundle, the architecture presents inde-
pendent execution units. Each one of these units operates a different vector of cells, and has its own
register bank to locally store all the intermediary results that are generated by the algorithms (useful for
memoization in DP algorithms), as well as any other value or dependency required in the immediate
computations, thus reducing memory access operations and improving the processing performance,
while maintaining an organized data structure (see figure 3.3). However, there will be situations where
an execution unit will require data values from a different unit (e.g. the execution units that are in advance
regarding the computation steps of an algorithm), which will generate dependencies that are required by
a different and delayed unit. Thus, a sniffing mechanism, operating in a small subset of registers in each
register bank, is implemented. These special registers will sniff others in an adjacent unit to ensure the
commitment of the required dependencies, and access them as they were in the same register bank,
thus maintaining the independence of the register banks while keeping data coherence and avoiding
unnecessary memory accesses.
Unit 0Unit 1...Unit n
RegisterBanks
Figure 3.3: Execution Units with the respective independent register banks.
In addition to sniffing, there is an additional way to share data between execution units (which are
not adjacent) without resorting to memory. This is specially important in DP algorithms, since there are
often dependency values that are used to compute all cells for a given number of iterations, being later
updated and repeating the process. Since there is a constant need to load these values from memory,
and given the independence between execution units, this would lead to redundant memory accesses,
as different execution units would have to fetch the same values from memory. This problem is solved
by adding a set of shared memory registers to each execution unit. These registers can be accessed
by all execution units and thus can be used to store dependencies that are required by several units,
removing the redundant memory accesses (see figure 3.4).
Although the register banks are mainly used to store dependencies between iterations, it is still often
required for DP algorithms to store some of their dependencies in memory, especially when they are re-
quired in a much later iteration of the algorithm. In order to minimize the impact of the resulting memory
accesses on the processing performance, a DSU is used to perform the necessary memory loads and
stores in parallel to the execution units. This way, while the execution units are computing the main steps
27
Memory
registers
Dual Port Memory
Memory
registersMemory
registers
Memory
registers
Unit 0Unit 1...Unit n
RegisterBanks
Sniffing SniffingSniffing
Figure 3.4: Register banks with the sniffing mechanism and the additional shared memory registers ineach execution unit (with the respective connection to the memory).
of an algorithm, the DSU is pre-storing or pre-loading cell values at the same time, which will only be
required at a later iteration. Here, contrary to common VLIW architectures, all the existing units (both the
execution units and the DSU) can access the memory, requiring a priority access list to avoid conflicts.
Since the DSU main function is memory access operations, it has the top priority over all other units.
DataIStream(DSU)
ExecutionUnit 0
ExecutionUnit 1
...Execution
UnitIn
RegisterBanks
ScratchpadMemory
InstructionMemory
JumpControl
PC
(Shared) VectorIFunctionalIUnits
Sniffing Sniffing Sniffing
Memory
registers
DualIPortIMemory
Memory
registersMemory
registers
Memory
registers
Figure 3.5: Proposed architecture.
To further ease the memory access delay problem, a local fast (scratchpad) memory is also included,
which is used to store constant values, required in several DP algorithms. These constant values are
28
InstructionMemory
Fetch
Decode
Decoder
Execution
Write-Back
Re
gisterUBan
ks
InstructionUBundle
JumpControl
U0U1...Un
MemoriesFunctional
Units
PC
EXECForwarding
WBForwarding
U0U1...Un
U0U1...Un
Figure 3.6: 4-Stage pipeline structure.
pre-fetched at the beginning of the computation by the DSU, and can only be accessed by the execution
units (with an access priority list to decide between them).
These specifications result in the architecture presented in figure 3.5. The architecture also presents
a 4-stage pipeline: a FETCH stage, where the next instruction is loaded from the instructions memory;
a DECODE stage, where the fetched instructions are decoded in all units; an EXECUTE, stage where the
FUs and memory operate the instructions; and a WRITE-BACK stage, where the results are written to the
register banks. The pipeline also includes stalling and data forwarding mechanisms to prevent hazards
and to minimize the number of stalls on the processor, respectively, and can be seen in figure 3.6.
U3U2U1U0 U7U6U5U4U3U2U1U0
U3U2U1U0
2n
n
n
DLPScalability
ILPScalability
Figure 3.7: Processor scalability: This example doubles the number of processed cells by doubling thevector size from n to 2n (DLP scalability to the left) and by doubling the number of available units from 4to 8 (ILP scalability).
By considering the set of characteristics listed above, the proposed architecture can also be easily
scalable in two distinct ways (see figure 3.7): by increasing the length of each execution unit and thus
increasing the vector length (DLP); and by increasing the number of different execution units, and thus
29
increasing the number of parallel instructions (ILP). The first solution would mainly require an increase
of the vector size that is processed in the functional units, while the second solution would require an
increase in the number of functional units. Both these solutions can be applied together, in order to
provide a better balance between both parallelism paradigms.
3.2.1 Register Banks
Each execution unit has its own private register bank of 28 registers, as well as a small set of 4 shared
memory registers, achieving a total of 32 registers (illustrated in figure 3.5). Although the presence
of private registers in each execution unit results into reduced register access times, better structural
organization and thus, better performance, it would be advantageous to have the possibility to share
values between units without resorting to copy the value to a shared register. Specifically, and using
the 2D matrix to represent the processing pattern of many DP algorithms as an example, horizontal and
diagonal dependencies between the cells in the edges of the execution units would require, in every
iteration, values to be passed from the adjacent unit to the one that requires those dependency values.
Given that these dependencies occur very frequently (every iteration), the delay caused by copying the
values would be very significant. To circumvent this, the previously referred sniffing mechanism is used
(see figures 3.4 and 3.5).
This mechanism affects a very small number of private registers (in the 2D example, only two reg-
isters in each register bank would require sniffing), and it consists on mirroring those registers to the
adjacent execution unit’s register bank. Accordingly, whenever an update is made on the registers being
sniffed, that same update is reproduced to the adjacent execution unit. Given that the typical types of
dependencies in DP algorithms follow a top-down and left-right approach, the sniffing mechanism is only
required from one unit to the one at its right, with the last unit (the one that computes the most left cells)
not requiring sniffing.
The existence of a sniffing mechanism does not exclude, however, the need for shared memory
registers. These registers are mainly used by a DSU to communicate with the memory. This is done
in order to separate the parallel memory operations that are handled by the DSU from the intermediary
results of the main algorithm computations, issued by the execution units, and thus avoiding register
access conflicts or data hazards. To further avoid conflicts, the sharing privileges between execution
units only covers load accesses, with the writing being exclusive to the register owner unit and to the
DSU. In case of a writing conflict between the DSU and one execution unit, a priority list is used, with
the DSU having the top priority. These memory registers also serve the purpose of reducing the number
of memory accesses, in situations where a dependency value, loaded by one execution unit, is required
by other units. Instead of being retrieved multiple times from memory to multiple execution units, these
dependencies can be loaded to only one unit and then be used by all units or, if the dependency requires
to be constantly updated, it can even be shifted to the other units memory registers, with the help the
DSU.
The proposed architecture also supports different word sizes, resulting in multiple words being stored
in each register if a multiple of the maximum word size is used. Accordingly, if the word size is half the
30
CMP+++
SUM SUM CMP CMP
Functional Units
...
HOLD
NOP+NOPNOP
SUM SUM CMP CMP
Functional Units
...
First Clock Cycle Second Clock Cycle
Figure 3.8: FU conflict control. During the first clock cycle, 4 execution units try to compute 3 sumoperations and 1 comparison. Since there are only 2 FUs capable of doing sums, the 3rd unit will holdits instruction and the previous pipeline stages will be stalled. In the second clock cycle, the FUs arenow free to operate the 3rd unit’s sum operation (the remaining execution units will not compute anyinstruction), finalizing all instructions in the bundle, and resuming the normal processor operation.
maximum value, each register will then store two different words. If the word size is a quarter of the
maximum value size, each register will store four different words, and so on. This design paradigm allows
for different accuracy ranges, improving algorithm performance if higher accuracy is not required (more
cells computed simultaneously, with the same hardware resources), while still supporting problems that
require a higher level of precision.
3.2.2 Functional Units
The FUs that are present in the architecture are shared between all the execution units. This is done
in order to optimize the resource usage, and to reduce the hardware requirements. However, this design
option will also require a conflict control, to manage those situations when multiple execution units try to
access more FUs than those available. Therefore, whenever there are execution units trying to access
a busy FU, a stall is generated, and the instruction will be held until the required FU is free, taking such
instruction additional clock cycles to compute (see figure 3.8).
The order how each execution unit gets assigned a FU follows a priority list, where the units that
process the left-most cells have higher priority than those that process the right-most cells. In order to
reduce the conflicts probability, more FUs could be added, leading to an increase in both hardware and
power requirements. To reach an optimal solution, the number of FUs should be the minimum possible in
order to not cause conflicts, leading to a better resources/usage ratio. In order to tweak the most suitable
number of available FUs, the DP algorithms must be characterized both in terms of the amount and type
of operations. Considering that most DP algorithms use simple operations like sums or subtractions,
shifts and logic operations, there is a requirement for multiple units of these types. By looking at the
subset of algorithms that are considered in this work, the considered set of available FUs in the devised
(c) DSU instruction for an architecture with 4 execution units.
Figure 3.9: Instruction words for the bundle and the composing units.
As it can be seen in figure 3.9(b), the encoding of the execution unit’s instructions comprehends
the common register address fields Ra, Rb and Rd (which correspond to the first and second operand
addresses, and to the destiny address, respectively), the WE field, that indicates when a register should
write a given value, and by the operation encoding fields, namely the Opcode field, which selects be-
tween the different types of instructions (arithmetic/logical, control/branch and memory access) and the
OpControl field, that identifies a certain modifier to the instructions (e.g., usage of immediate values in
arithmetic/logical operations, usage of inequality comparisons for control operations). Three more spe-
cial control fields are also present in this encoding: Td, Ta and Tb. The Td field enables a broadcast write
(enabling a 3-way register write, relevant for DP algorithms that have up to 3 dependencies), while bits
Ta and Tb are used to specify which part of the data is to be loaded or written to registers, for memory
instructions that operate with dividable parts of data.
The summarized implemented instruction set can be seen in Table 3.1 and the full instruction set can
be seen in Appendix A.1. The instruction set presents commonly used arithmetic and logic instructions
33
(e.g. addition, subtraction and logic OR, AND and XOR with their respective immediate counterparts)
as well as a special Maximum and Move (MAXMOV) instruction. This instruction performs the maximum
operation while moving a register in parallel, proving useful in DP algorithms that present dependencies
for the iteration after the next (e.g. diagonal dependencies in the SW algorithm). The SUM, SUB and MAX
instructions are the only ones that can be modified by the Td field, to enable the broadcast write, as
can be seen in Appendix A.1. The MAX and MAXMOV instructions can also be modified by the OpControl.
When the bit 5 of OpControl is active, both instructions perform gap register comparisons for the SW
algorithm. The MAX instruction also concatenates the result of the maximum operation with a different
register (e.g. concatenation with a sniffing register).
The memory instructions are responsible for the load and store operations on the RAM and local
fast memories. The load operations require a previous indexation, encoded by the instructions INDEX
MADDR (for the RAM memory) and INDEX SPADDR (for the local fast memory). The different sized loads
and stores are controlled by the Ta and Tb instruction fields (see Appendix A.1). The INDEX SPADDR
instruction can also be modified by the OpControl in order to perform a comparison between two values
to find the correct address to index (requiring a comparison FU). This is useful for alignment algorithms,
which present substitution scores dependent on the aligning symbols.
The control instructions consist solely in delayed branches, where the instruction following the branch
is still computed before branching. The instruction set is thus composed by a simple branch instruction
as well as common conditional branches (e.g. not equal, less than, greater than) and their immediate
counterparts.
The DSU has a different instruction format than the execution units and it is depicted in figure 3.9(c).
This instruction format also explores ILP, since it computes three distinct and parallel operations: a
memory load/register write (bits 15 - 0); a memory write (bits 31 - 16); and a register shift (bits 35 - 31).
It is worth noting that the DSU instruction’s length will depend on the number of execution units that are
present in the architecture. In fact, the depicted figure 3.9(c) shows the case where 4 execution units
are present, including 2 bits required to address each one of the 4 units in both Unit fields. The memory
load operations are responsible for loading a value from memory (addressed by Madd), to one of the
memory registers (addressed by the the two-bit field Radd, since there are only 4 memory registers per
execution unit) in one of the existent execution units (field Unit). Since the load instructions require a
preliminary indexation before the actual load, the bits AddrnEN and regWE will identify the index operation
and the load operation, respectively. In order to optimize the throughput of memory load instructions,
these two bits also enable simultaneously indexation and load operations. In such situation, both the
Unit and Radd fields identify the register to store the data loaded from memory, while the Madd field
identifies the new memory address to be index for a later load operation. The memory write operations
are very similar to the load operations, with the exception of only requiring one enable flag (MWE) for
allowing writing access to the memory. There is also a localWE field that chooses between the RAM
memory and the local fast memory, since the DSU is the only unit that can also write to the local memory.
The register-shift operation is responsible for creating, along with a memory read or write, a register
34
Table 3.1: Abridged implemented instructed set. The full instruction set is depicted in Appendix A.1.INSTRUCTION MNEMONIC
Arithmetic and Logic Instructions
Add, Subtraction SUM, SUB
Maximum, Maximum and Move MAX, MAXMOV
Comparison CMP
Arithmetic and Logic Right and Left Shift SRA, SRL, SLA, SLL
Logic OR, AND, XOR OR, AND, XOR
Memory Instructions
Index Memory address INDEX MADDR
Load Byte, Half-word, Data LB, LH, LD
Index local memory address INDEX SPADDR
Local Memory Load SPAD LD
Store Byte, Half-word, Data SB, SH, SD
Control Instructions
Delayed Branch BRD
Delayed Branch Equal, Not Equal BEQD, BNED
Delayed Branch Less Than, Less Than Equal BLTD, BLTED
Delayed Branch Greater Than BGTD
window mechanism integrating all the memory registers. This mechanism is depicted in figure 3.10 and
can reduce the impact of memory accesses, by pre-loading a data value that will be required in future
iterations of the computation or by pre-storing a value to be later used in future iterations. These memory
accesses are done in parallel to the computations, without overwriting any values that have yet to be
used, in order to prevent any data hazards. The registers to be shifted are chosen by the ShiftAddr
bit-mask, from one of the periphery execution units (unit 0 or unit n) to the other, with the direction being
chosen by the Left/Right field. An enable flag (ShiftEN) activates the shift operation.
It is also important to notice that the shift operation that is implemented by the DSU operates in-
dependently to the FUs. Moreover, given the higher priority of the DSU over the execution units, the
shift operation will always overwrite the target memory register when an execution unit tries to update
that register during the same clock cycle. For this reason, the memory registers should mainly be used
by execution units for accessing the stored values and not for updating them, as that is the DSU main
functionality.
As it was previously mentioned, more execution units can be easily added to the architecture by
widening the instruction bundle and adding register banks to the new units. Furthermore, these exe-
cution units can also be expanded to accommodate more words, by increasing their vector width. This
would also require some modifications in the FUs and in the memory accesses, in order to maintain
compatibility. The former scalability solution is better suited for algorithms that require many instructions
per iteration, while the latter has a better use by algorithms that require less instructions and work with
higher volumes of data.
35
Memory
Register fileof Unit 0
Register fileof Unit 1
Register fileof Unit n-1
Register fileof Unit n
Figure 3.10: Register Window example. In this example, the third register in each array is shifted to theregister of the array on its right, while the left-most array is loaded with a new value from memory.
3.3 Interface
The proposed architecture is envisaged to act as an accelerator element highly interconnected with
an off-the-shelf GPP, where the non-regular and less complex parts of the algorithms (e.g. control and
management structures) will be executed. Accordingly, it was decided to extend the design of the pro-
posed architecture to its interface with the outside world. In particular, it is envisaged an interfacing
structure that aims to be suited to implementations supported either in ASIC or FPGA technologies.
Naturally, a greater emphasis will be given to FPGA-based implementations, due to its greater availabil-
ity in the lab.
System On Chip (SOC) processing structures are usually formed by heterogeneous aggregates of
processing elements. In particular, they commonly include a set of GPP elements and several accel-
erating processing structures. The GPP elements typically comprehend a processor/microcontroller,
together with the cache, the RAM and all the corresponding interconnections and input/output periph-
eral ports. A popular example of such a SOC structure based on FPGA technology is the Xilinx Zynq
FPGA, comprehending a Processing System (PS) and Programmable Logic (PL) sections. The latter
section is frequently used to create custom designs and integrate them with the processor in the PS.
The proposed architecture is then particularly suited to be integrated as a core located in the PL section
of the FPGA.
In this section, it will be presented an interfacing structure to the proposed VLIW processor based
on the Advanced Microcontroller Bus Architecture (AMBA), according to its Advanced eXtensible Inter-
face (AXI). These specifications are adopted by some FPGA vendors (e.g. Xilinx) and are considered to
be the de-facto standard for 32-bit embedded processors, due to being well documented and royalty free.
After analyzing the proposed architecture, previously presented in this chapter, three main structures
were identified as requiring communications with the GPP element: i) the instruction memory, ii) the
RAM memory and iii) the local fast memory. The GPP only requires write access to all these memories,
since they are only used by the VLIW core.
When integrated with the GPP, all the required data to be computed in the proposed architecture core
is stored in the system’s RAM, requiring it to be loaded to the memories inside the core. The GPP is
thus responsible to select and send the correct data to the correct memories, depending on the algorithm
36
that is being processed. Ideally, the data is transferred in parallel to the algorithm computations, with a
controller unit monitoring the data transfer to guarantee coherence. However, the VLIW core memories
only have 2 access ports (a write-only and a load-only port, as previously detailed) and, with exception
of the instruction memory, both ports are already used by the core, preventing the access in parallel by
the GPP, due to structural conflicts. To solve this, a multiplexer at the entrance of the write ports for the
RAM and local fast memory is required. This multiplexer thus chooses between the proposed core or the
GPP for writing access. The multiplexer selection is done by an additional control unit, located outside
the proposed core and inside the PL (see figure 3.11). This control unit must then be able to recognize
the current algorithm phase to switch the multiplexer accordingly and to enable the memory writes. This
can either be done by also sending the instructions that are being processed by the proposed core to
the control unit, or by using a feedback system, where the VLIW core communicates the current state of
the operations.
ProgrammableULogic (PL)
AXIUMemoryController
AXIUMemoryController
ProposedUArchitectureUCore
Datapath
RAMLocalUFastMemory
ControlUUnit
….
ControlUUnit
AXIUInterconnect
ProcessingUSystem (PS)
GPP
Figure 3.11: AXI Interconnection scheme between the RAM and the local fast memory in the proposedarchitecture core and the GPP in the PS.
As opposed to these two memories, the instruction memory has only one port being used to load
the instructions to the different units, inside the proposed core. By connecting the remaining free port to
the GPP, we can seamlessly transfer the new instructions to the VLIW core at the same time that other
instructions are decoded in the core, without the need for a multiplexer (see figure 3.12). However, given
37
the difference between the data transfer frequency of the GPP to the VLIW core and the core’s operating
frequency, structural hazards can occur, and thus a control unit is required. Accordingly, this unit must
be able to monitor the memory, ensuring a correct data transfer. Therefore, the control unit requires the
knowledge of the current instruction being computed in the VLIW core (similarly to the control units for
the other two memories), as well as the control of the memory port signals, in order to appropriately
enable the writing access and choose the addresses for the data transfers.
AXIaMemoryController
InstructionaMemory
DSUUnitN ... Unit1 Unit0
AXIInterconnect
ProcessingaSystem (PS)
GPP
….
ProgrammableaLogic (PL)
ControlaUnit
ProposedaArchitectureaCore
Figure 3.12: AXI Interconnection scheme between the instruction memory in the proposed architecturecore and the GPP in the PS.
In order to connect the memories inside the VLIW core to the GPP, AXI controllers are required.
These units provide the interface to connect the memories to a central AXI Interconnect, which in turn
completes the communication bridge to the GPP in the PS. Figure 3.13 depicts the full interfacing
structure scheme.
The AXI follows a handshake process to transfer both the address, control and data information,
where the master (GPP) asserts and holds a VALID signal when data is available to transfer, and the
slaves (memories inside the VLIW core) respond with a READY signal when they are able to accept the
data. When both signals are active, the transfer occurs. The AXI supports data bursts, which are nec-
essary for the memories in the VLIW core. The instruction memory requires multiple instructions to be
stored prior to the start of the algorithm, which must be sent in long bursts to reduce the stall time.
Similarly, the RAM and local fast memory will also require long bursts of data, in order to prolong the
algorithm computations without stalling the core, since their write ports can only be accessed either by
38
ProcessingLSystem RPSF
GPP
CacheLSystem 1 Control
MemoryLInterfaceCentral
Interconnect
I/OPeripherals
AXILInterconnect
ProgrammableLLogic RPLF
AXILMemoryController
InstructionLMemory
DSUUnitN ... Unit1 Unit0
ControlLUnit
AXILMemoryController
AXILMemoryController
RAMLocalLFastMemory
Datapath
ControlLUnitControlLUnit
ProposedLArchitectureCore
Figure 3.13: Interface scheme for the proposed architecture core.
the core or the GPP at a given time. Using the SW algorithm as an example, and due to the large
length of the reference and query sequences, the RAM memory can only store a limited number of
sequences. In sequence alignment algorithms, it is common to perform multiple query alignments to
the same reference sequence. Therefore, every time that a fixed number of query sequences is aligned
to the reference sequence, a new set of queries must be sent from GPP. During this time, the VLIW
core will be stalled until all the new queries are stored in the RAM memory for the new alignments. By
maximizing the burst length of query sequences, the time that the core is stalled can be minimized, thus
increasing performance.
39
An important problem that was not yet addressed is the number of input/output pins in the proposed
core. The number of pins could significantly reduce the operating frequency of the core, due to an
increase in routing complexity. In order to address this problem, it is necessary to know the width of the
data being transferred to and from the proposed core, and how to reduce those widths.
The instructions length for the VLIW core varies with the total number of units (execution units and
DSU) that are present. The encoding corresponding to each execution unit has a length of 32 bits,
and the DSU has a length varying with the number of execution units present. As an example with 4
execution units and one DSU, the full instruction length would then be 164 bits. Adding a 32-bit RAM
and local fast memory on top of that, the required total number of bits to be transferred to the core would
raise to 228 bits. This excludes the outputs of the proposed core that are necessary to send information
to the control units, as well as the algorithm results back to the GPP. In order to reduce these input
widths, the transferred data should be shortened and sent in more frequent and smaller bursts. For the
instructions, the adopted width should match the width of each unit. Therefore, each instruction sent
from the GPP would be divided in the number of units present in the core. For the previous example
with 4 execution units and 1 DSU, it would be required one 36-bit and four 32-bit data transfers (five data
transfers in total) for the full instruction to be available in the proposed core.
For the remaining memories, a similar solution can be used. Since the proposed core allows the word
size to be a multiple of a maximum data width, the inputs for these memories could have the same width
as the word size, with multiple word-sized transfers being required for the full data to be transferred. The
same can be applied to the solution output, which also has the same width as these memories.
Finally the control signals sent by the VLIW core to the control units previously introduced, should
only consist in small flags and thus should not require any additional modifications.
3.4 Summary
This chapter listed all the necessary requirements for the proposed architecture, and gave a detailed
description of all the architecture structures, including an interfacing structure proposal.
Exploiting both DLP and ILP, the resulting architecture consists in a VLIW architecture with multiple
execution units and DSU. Each execution unit is responsible for the operation of an independent data
vector, while the DSU takes care of parallel memory accesses. In order to enable communications be-
tween the execution units, shared register sets and sniffing mechanisms are implemented in the register
banks. Additionally, the existence of two distinct memories (RAM and local fast memory) helps reducing
the conflicts between the units when accessing the memory, reducing the existence of delays and pro-
moting a better structural organization. All these characteristics not only result in an optimized processor
for DP algorithms, but also in a programmable architecture with potential for broader compatibility.
The interfacing structure to connect the proposed architecture to a GPP is discussed in the last
section of the chapter. Although some techniques and considerations are taken for this interface, the
proposed interface was not implemented in our work, due to time constraints.
This chapter describes the two algorithm implementations made for the proposed architecture: the
SW and the Viterbi algorithms. It focus on the processing scheme used by the algorithms, as well
as the necessary instructions to compute them in the proposed architecture, together with any special
mechanisms and considerations used.
The considered architecture for the implementations will consist in 4 execution units and 1 DSU with
32-bit vectors. The SW implementation will present 8-bit words, processing 4 words (cells) per execution
unit, while the Viterbi implementation will have 16-bit words, processing 2 words (cells) per execution
unit.
4.1 Smith-Waterman
As explained in the second chapter, the SW algorithm computes the local alignment between a query
and a reference sequence. With the help of a substitution score matrix and gap penalty scores (affine
model) that indicate, respectively, the weight of matches/mismatches and insertions/deletions in the
alignment, the algorithm fills the resulting score matrix, from the upper left to the bottom right. This filling
operation respects the three dependencies that are present in the computations of every cell: the left,
top and top-left cell dependencies, resulting in parallelism extraction along the anti-diagonal, as it was
previously seen.
In addition to the anti-diagonal parallelism extraction, the algorithm will also follow a processing along
the query sequence (see figure 4.1). This processing scheme results in two distinct algorithm loops: an
inner loop, where a small reference sub-sequence is compared against the full query sequence; and an
outer loop, where a new reference sub-sequence is loaded, restarting the inner loop.
Although the processor is configurable to admit other setups, the described implementation uses, in
each of its 4 execution units, 32-bit vectors, each composed of 4 8-bit words, resulting in 16 8-bit cells
being simultaneously computed in all units.
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3
Figure 4.1: SW processing scheme along the query sequence and extracting parallelism along theanti-diagonal.
Revisiting the SW main equations, it is possible to observe that, in order to compute the result for the
cell (i, j), the negative gap values (β or α) are added to the vertical (cell (i − 1, j) (4.3)) and horizontal
42
(cell (i, j − 1) (4.2)) dependencies, and the substitution score is added to the diagonal dependency (cell
(i− 1, j − 1) (4.1)).
Assuming that both sequences and the substitution matrix are already stored in memory, and the
gap and dependency values are already stored in the register banks, the required algorithmic steps in
the inner loop of the algorithm can be broke-down to the following: an indexation and respective loads
of the query symbols and substitution scores; the 3 dependency sums with the substitution and gap
scores; and two maximum evaluations in order to find the final cell result.
Hi,j = max
0Ei,j
Fi,j
Hi−1,j−1 + Sm(qi, dj)
(4.1)
Ei,j = max
{Ei,j−1 + βHi,j−1 + α
(4.2)
Fi,j = max
{Fi−1,j + βHi−1,j + α
(4.3)
Due to its length, the query sequence is stored in the RAM , while the substitution score matrix is
stored in the local fast memory. Therefore, the query sequence memory accesses can be performed by
the DSU, allowing the execution units to load the substitution symbols in parallel, taking a total of two
clock cycles per iteration (see clock cycles 1 and 2 for the Unit 0 in figure 4.2). Since the substitution
score load requires the current query symbol in order load the correct value (by performing a comparison
between the query symbol and the reference symbol), the query symbol that is being loaded in parallel
will be used in the next iteration, with the current query symbol being already present in the register file.
Following the substitution score load, the 3 main sums can now be computed, since all the 3 depen-
dencies and gap scores are already stored in the register banks. These sums can be encoded to one
single sum instruction if the Td flag is activated, as it was seen in the architecture’s instruction set in the
previous chapter. Therefore, these 3 sums will only take 1 clock cycle to compute (see clock cycle 3 for
Unit 0 in figure 4.2).
Finally, the maximum operations will find the final result, which corresponds to the maximum value
of the three previous sum results. Two maximum instructions are necessary, thus taking 2 clock cycles
to finish (see clock cycles 4 and 5 for Unit 0 in figure 4.2). At the same time, the query symbols in each
execution unit (which are stored in the memory registers) are shifted to the adjacent unit, in order to be
reused during the next iteration. This can be done since the parallelism along the anti-diagonal and the
processing along the query sequence are exploited. The query symbol pre-loading during a previous
clock cycle, together with the symbol shifting corresponds to a register windows scheme, similar to the
one depicted in figure 3.10.
After the final cell value is computed, the inner loop restarts. The table in figure 4.2 details the inner
loop for an example with 4 execution units and 1 DSU.
The ILP is exploited in the SW implementation by having an offset of one instruction computation
between adjacent execution units. Due to the processing along the query sequence, the most advanced
43
Data…Stream…Unit Unit…0 Unit…1 Unit…2 Unit…3
1 Index crit. dep. (Unit 0) INDEX SPADDR (i+3,j) 1
2 Load crit. dep. (Unit 0) | Index crit. gap (Unit 0) SPAD LD INDEX SPADDR (i+2,j) 2
3 Load crit gap (Unit 0) | index query symbol (Unit 0) SUM (Td = 1) SPAD LD INDEX SPADDR (i+1,j) 3
4 Store cell result (Unit 3) | Load query symbol (Unit 0) MAXMOV SUM (Td = 1) SPAD LD INDEX SPADDR (i,j) 4
5 Store gap result (Unit 3) | Shift query symbols (u0 to u3) MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) SPAD LD 5
6 … … MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) 6
7 … MAX (Opcontrol(5) = 1) MAXMOV 7
8 … MAX (Opcontrol(5) = 1) 8
Clo
ck…C
ycle
s
Figure 4.2: Main iteration (inner loop) operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSU.
unit will correspond to the unit that is aligning the latest query symbol. Also, given the anti-diagonal
parallelism and the dependency propagation from the top-left to the bottom right, the most advance unit
will also correspond to the left-most unit, as can be seen in figure 4.1. Accordingly, due to the anti-
diagonal parallelism and the number of existent units, these computational offset will not introduce any
conflicts, as was seen in the previous chapter (see figure 3.2).
The ILP exploitation greatly reduces the required number of FUs. From the table in figure 4.2, it is
possible to see that, with 4 execution units, there are never more than 1 SUM, 1 INDEX SPADDR, and 2
maximum instructions (MAXMOV and MAX) being computed during the same clock cycle. Therefore, the
SW algorithm implementation will only require 3 SUM/SUB units (since the sum instruction refers to a 3-
way broadcast sum), 2 MAXIMUM units and 1 COMPARISON unit (for the INDEX SPADDR instruction). If more
execution units were present, the required number of FUs would be higher, or it could remain the same
at the cost of adding stalls due to the rise of conflicts.
The outer loop of the SW algorithm consists only in the load of new reference symbols and occurs
every time the end of the query is reached by an execution unit. These symbols are stored in the memory
registers, and thus can be loaded in parallel by the DSU, similarly to the query symbols. The table in
figure 4.3 depicts the instructions in the DSU and execution units for the outer loop. As we can see from
figure 4.3, the outer loop will not introduce any additional clock cycles since it can be fully performed in
parallel by the DSU.
Due to the partitioning of the reference sequence, some problems will arise when solving cell de-
pendencies between execution units, specifically the horizontal and diagonal dependencies. Since the
processing scheme follows the query sequence, thus adopting a top-down anti-diagonal parallelism
approach, the computed cell values will be stored in the register banks and be used as vertical de-
pendencies during the next algorithm iteration. In the following iteration, the register with the vertical
dependency value is overwritten with the new values. The same happens for the diagonal and hori-
zontal dependencies. Inside the same unit, these dependencies are rapidly retrieved, since they are all
located in the same register bank. However, the dependencies between units require the use of sniffing
mechanisms. These mechanisms are used by a unit to access the dependency cells from the adjacent
execution unit to its left (unit in advance), in order to use them in the next iteration, as they were stored
in its own register bank. Contrary to the other dependencies, the diagonal dependency requires two
44
DatapStreampUnit Unitp0 Unitp1 Unitp2 Unitp3
Load crit. dep. 6Unit 07
Index crit. gap 6Unit 07
Load crit gap 6Unit 07
Index query symbol 6Unit 07
Store cell result 6Unit 37 | Index ref symbol (Unit 0)
Load query symbol 6Unit 07
Store gap result 6Unit 37 |Load ref symbol (Unit 0)
Shift query symbols 6u0 to u37
Load crit. dep. 6Unit 07
Index crit. gap 6Unit 07
Load crit gap 6Unit 07
Index query symbol 6Unit 07
Store cell result 6Unit 37 | Index ref symbol (Unit1)
Load query symbol 6Unit 07
Store gap result 6Unit 37 |Load ref symbol (Unit 1)
Shift query symbols 6u0 to u37
1313 … MAX 6Opcontrol657 = 17
SUM 6Td = 17 11
12 … MAX 6Opcontrol657 = 17 MAXMOV 12
11 … … MAX 6Opcontrol657 = 17 MAXMOV
8
Ou
terp
Loo
p
(Un
itp1
) 9 MAXMOV SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 9
10 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 10
8 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-17,j7 MAX 6Opcontrol657 = 17
SUM 6Td = 17 6
7 SPAD LD INDEX SPADDR 6i+N,j7 MAX 6Opcontrol657 = 17 MAXMOV 7
6 Index crit. dep. 6Unit 07 INDEX SPADDR 6i,j+17 MAX 6Opcontrol657 = 17 MAXMOV
SPAD LD INDEX SPADDR 6i+6N-37,j7 4
5 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 5
Ou
terp
Loo
p
(Un
itp0
) 4 MAXMOV SUM 6Td = 17
3 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 3
1 Index crit. dep. 6Unit 07 INDEX SPADDR 6i+N,j7 1
2 SPAD LD INDEX SPADDR 6i+6N-17,j7 2
Figure 4.3: Inner loop and outer loop operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSUand 2 algorithm iterations. The outer loop for each execution unit is comprised by two DSU instructions.
registers in each unit. This is due to the fact that an anti-diagonal scheme is used, and therefore, the
computed cell value will only be used as a diagonal dependency two iterations after the current one (thus
being necessary to store the value to be used in the next iteration and two iterations after the current
iteration).
However, for the most advanced unit (that is aligning the most-left symbols of the reference sub-
sequence), the horizontal and diagonal dependencies cannot be retrieved from its adjacent unit, since
there is no adjacent unit in advance to it. These dependencies are computed in the previous reference
sub-sequence, and therefore should be stored in memory. In fact, as it can be seen in the tables of
figures 4.2 and 4.3, the most delayed unit (that is aligning the most-right symbols of the reference sub-
sequence) will have its final cell values stored to the memory by the DSU, in order to be retrieved later
on by the most advanced unit (with the help of the DSU).
These critical sections (see figure 4.4) only occur between these two execution units that are com-
puting the edges of the reference sub-sequences, and do not introduce any additional clock cycles,
since the memory loads and stores are done in parallel by the DSU. Therefore,their processing can be
seen as a window register scheme, where the new reference symbols are loaded just before they are
required. The sniffing mechanism cannot be applied in the critical sections due to the large length of
the query sequence and the fact that the units are not adjacent. Since the processing follows the query
sequence, the most delayed unit would need to store all of its computed cell values until the end of the
query sequence, which would prove impossible due to the number of existing registers when compared
to query length.
The affine gap model will also require a mechanism similar to the horizontal and vertical dependen-
45
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3 U0 U1 U2 U3
Figure 4.4: Critical section between two sub-sequences of the reference sequence for an example casewith 4 execution units. Each color/symbol represents a different iteration, with 4 iterations being depicted.The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory.
cies. Since this model takes into account two distinct gap values (an initialization value and an extension
value in case there are several gaps in a row), all execution units will have two registers in their register
bank with both gap values constantly stored. During the maximum operations of the SW algorithm, an
auxiliary register will store the information regarding which dependency originated the max result. If it is
a vertical or horizontal dependency, the auxiliary register will compare its previously stored value, and
check if the new result is a gap extend or initialization, updating its value accordingly. This way, during
the sum operations in the following iteration, the correct gap value to be used is already stored in the
register bank.
For the most advanced execution unit, the auxiliary register that indicates the type of gap for the
horizontal dependencies belongs to the most delayed unit in a previous iteration. Given that the required
value is computed in a former iteration of the algorithm, it is stored in memory to be later loaded by
the execution unit in advance, similarly to the horizontal dependency (see DSU instructions in the table
in figure 4.3). Also, just like the horizontal dependencies, the adjacent units share these auxiliary gap
registers by using the sniffing mechanism.
4.2 Viterbi (Profile HMMs)
The Viterbi algorithm can find the most likely sequence path of hidden states in a HMM for a given
sequence of observed outputs. As previously mentioned, this algorithm is well suited for solving se-
quence alignment problems, with the help of profile HMMs. These HMMs take into account a family of
similar sequences (profile), thus enabling an alignment between a query sequence and the whole family
(which can be seen as the reference sequence) at once. They will also require additional states not
46
present in normal HMMs, as depicted in figure 2.7. These special states are specific for the multihit
local alignment, achieving several local alignments between the compared sequences. This model was
chosen to facilitate the comparison to the GPP implementation of the Viterbi present in the next chapter,
as well as to enable a comparison to the previously explained SW algorithm.
The considered Viterbi algorithm implementation will follow the same anti-diagonal parallelism and
processing scheme along the query sequence as the SW algorithm (see figure 4.1). It will have 16-bit
words, resulting in 8 cells being computed at every iteration, two per execution unit. This translates into
8 query and reference symbols being compared every iteration.
This implementation will require, in addition to the query sequence, a profile corresponding to the
reference sequence, with transition and emission values between all the existing states, all stored in
memory. The profile should follow the optimizations made by the HMMER [9] application, since it will
be used as a comparative study in the following chapter. This optimized profile has the transition and
emission values aligned to the algorithm’s access pattern, resulting in faster accesses for these values.
However, the access pattern implemented by the HMMER application consists in a stripped pattern along
the query sequence, based on Farrar’s [12] implementation for the SW algorithm (see figure 4.5(a)). As
a result, the profile must be modified to adapt to the anti-diagonal access pattern used in the proposed
architecture.
Given that two symbols are being compared in each unit, the profile should then group the emission
and transition scores in pairs, in order to both scores being retrieved in a given unit with only 1 load
instruction. In fact, after analyzing the profile in the HMMER tool, it was observed that each combination
of query-reference symbols will only require a total of two different emission/transition scores, instead
of a different score for every state. This occurs due to score overlapping between different states.
Furthermore, given the fact that two cells are computed in each unit, it will result in 2 load instructions
per unit, for a total of 8 load instructions at every iteration. Figure 4.5(b) depicts the emission/transition
score pattern that should be used for the implemented architecture. It is important to notice that these
memory accesses will not have any influence in the algorithm throughput, since they can be computed
exclusively by the DSU, in parallel to the main algorithm operations (see Appendix B.1).
Furthermore, both the emission and transition, ordered accordingly to the query sequence, have
new values being retrieved for each new reference symbol. Given the potentially large size of the query
sequences, the storage of all emission and transition scores in the proposed architecture cannot be
accommodated in the registers banks, and so, similarly to the SW algorithm, only the smaller required
subset of scores will be available at any given instant, with the rest being stored in memory. Effectively,
for every sequence symbol being computed in the proposed architecture, there must be a different set
of emission and transition scores, of which a small subset of those scores must be retrieved at every
iteration.
The operations required to compute a pair of cells in one execution unit are listed in figure 4.6. Both
the sequence and query symbols, as well as the respective emission/transition scores required for any
47
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
g N
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment is
g N
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment is
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
possible to parallelize the
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
possible to parallelize the
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Striped blocks of 14 M emissions1,5,9,13 2,6,10,14 3,7,1,x 4,8,2,x
(a) Profile example for the HMMER [9] platform (left), with the respective stripped pattern (right).Each cell in the transition costs matrix has the 4 costs for all the 4 cells computed in parallel. Foreach cell, only two different transition scores are used for all the 7 transitory states (representedin grey). The last row represents the transition scores that are necessary for the lazy loopsresulted from the stripped pattern.
(b) Profile example (with random costs) for the proposed architecture (left), with the respective anti-diagonal pattern (right). Each unit computes 2 cells, which results in 4 transition/emission costsnecessary for each unit. The 2 costs per cell cover all the transitory states. The cells in each unithave their correspondent in the diagonal pattern matched by the colored circles.
Figure 4.5: Comparison of example profiles for the HMMER platform [9] (computing 4 cells in parallel)and the proposed architecture (computing 8 cells in parallel). The way that the scores are orderedaccordingly to the used processing pattern is highlighted in both examples.
given iteration, are stored in the respective registers banks, prior to any of the cell operations being
computed. Just like the SW algorithm, the main operations for the three main states (M , I and D)
consist in sums/subtractions and maximum operations. The same also applies for the special states B,
E and J .
The dependencies required for the M state will differ from the SW algorithm, since they will now
require both the diagonal dependency of the states I and D, whereas, on the SW, only the M diagonal
48
M state computation
E state computation
D+1 state computation
I state computation
J state computation
B state computation
Unit 1 Unit 2 Unit 3 Unit 41 SUM xBv SUM xBv
2 SUM xBv SUM xBv
3 SUM mpv SUM mpv
4 MAX sv MAX sv SUM mpv SUM mpv
5 SUM ipv SUM ipv MAX sv MAX sv
6 MAX sv MAX sv SUM ipv SUM ipv
7 SUM dpv SUM dpv MAX sv MAX sv
8 MAX sv MAX sv SUM dpv SUM dpv
9 SUM sv SUM sv MAX sv MAX sv
10 MAX xEv MAX xEv SUM sv SUM sv
11 SUM dcv SUM dcv MAX xEv MAX xEv
12 SUM dcv SUM dcv
13 SUM mpv SUM mpv
14 SUM mpv SUM mpv
15 SUM ipv SUM ipv
16 MAX sv MAX sv SUM ipv SUM ipv
17 SUM xJ SUM xJ MAX sv MAX sv
18 SUM xJ SUM xJ
19 MAX xJ MAX xJ SUM xJ SUM xJ
20 SUM xJ SUM xJ
21 SUM xB SUM xB MAX xJ MAX xJ
22 MAX xB MAX xB SUM xB SUM xB
23 MAX xB MAX xB
Clo
ck C
ycle
s
Figure 4.6: Main iteration (inner loop) operations for the Viterbi algorithm in the proposed architecture.Only the execution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.
dependency and the current I and D states were required. This will result in a delayed load/store
scheme by using additional registers to store the previous and current scores, similar to the solution
used for the diagonal dependencies in the SW implementation. The remaining dependencies for the
I and D states are implemented in the same way as their SW counterpart (see the equations (2.10),
(2.11) and (2.12) in chapter 2).
The special states B, E and J are also required to be updated at every iteration, since the B state
dependency is used in the computation of the M state score, while depending itself on the J state. In
turn, the J state depends on the E state (see figure 2.7).
The E score will correspond to the current maximum score for the corresponding sequence symbols
in any execution unit. Accordingly, it has to be constantly updated at every iteration and propagated
to the computation of the states J and B. In turn, the J state compares the fixed transition cost of
moving from the updated E state to the cost of remaining in the J state. Finally, the B score takes into
account the newly updated J score and compares its cost to the cost of moving from state N to state B.
This special state N will be computed in the outer loop, since it only depends on the current reference
sequence symbol. The different loop and move transition costs are constant throughout the algorithm
computations and thus they are pre-stored in the register banks for faster access. These special states
will then introduce additional sums and maximum operations in the inner loop of the algorithm, which
can be seen in figure 4.6.
The remaining special state C is only updated in the outer-loop, just like the N state. While the N
state only depends on the current reference sequence symbol, the C state corresponds to the maximum
49
cell value in the respective execution unit. Therefore, after all execution units reach the end of the
query sequence and before they start computing a new sub-sequence of the reference sequence, the
maximum C score must be found between all units. This is possible by storing the C scores in the
shared registers, which makes them available for all units. The final C score is then stored in the first
execution unit and a new sub-sequence of symbols can start its computation (see figure 4.7).
UnitF1 UnitF2 UnitF3 UnitF4
1 OR dcv OR dcv SUM dcv SUM dcv
2 OR xEv OR xEv SUM xEv SUM xEv
3 OR mpv OR mpv SUM mpv SUM mpv
4 OR ipv OR ipv SUM ipv SUM ipv
5 OR dpv OR dpv SUM dpv SUM dpv
6 SUM xNv SUM xNv
7 SUM xNv SUM xNv
8
9
10 SUM xN SUM xN
11 SUM xN SUM xN
12 SUM xEv SUM xEv
13 SUM xEv SUM xEv
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Clo
ckFC
ycle
s
InnerFLoop
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Vectors(set
to(-infinity
gOR(and(SUM
instructions
are(used(to
avoid(adding(FUsF
OuterFLoopFInitialization
OuterFLoopFFinalization
(additionalFspecialFstates)
7 SUM xNv SUM xNv
8
9
10 SUM xN SUM xN
11 SUM xN SUM xN
12 SUM xEv SUM xEv
13 SUM xEv SUM xEv
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Vectors(set
to(-infinity
gOR(and(SUM
instructions
are(used(to
avoid(adding(FUsF
Clo
ckFC
ycle
s
OuterFLoopFInitialization
OuterFLoopFFinalization
(additionalFspecialFstates)
InnerFLoop
Figure 4.7: Outer loop pseudo-code of the Viterbi algorithm in the proposed architecture. Only theexecution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.
The fact that all execution units reach the end of the query sequence before starting the alignment
of a new sub-sequence introduces a small delay that was nonexistent in the SW implementation, since
now there will be 3 initialization and finalization iterations, at the beginning and at the end of the query
sequence, respectively, for every new sub-sequence of the reference sequence, as can be seen in fig-
ure 4.8. Additionally, the processing scheme will also have critical sections just like those seen in the
SW algorithm (see figure 4.8). To solve them, a similar register window scheme is used, where the
dependencies generated in the last execution unit are stored in memory after they are computed, and
the dependencies required by the first unit are loaded before they are needed. This is also consolidated
by the delayed load/store scheme, mentioned above.
The ILP that is adopted in the Viterbi’s implementation will also differ from the one that was observed
with the SW algorithm. Previously, all execution units were 1 instruction in advance regarding their
adjacent unit, resulting in the most advanced unit being 4 instructions in advance, regarding the most
delayed unit. In Viterbi’s implementation, the delay between instructions only occurs in pairs, resulting in
the first two units being in advance 1 instruction in comparison to the last two units. This can be seen in
figures 4.6, 4.7 and in Appendix B.1 (where the instructions appear in pairs). This was done in order to
keep the same number of FUs that were used for the SW algorithm implementation. If an identical ILP
extraction was used, the required number of FUs would be greater, but it would come with an increase
50
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3 U0 U1 U2 U3
Figure 4.8: Critical section of the Viterbi implementation between two sub-sequences of the referencesequence for an example case with 4 execution units. Each anti-diagonal/color/symbol represents adifferent iteration, with 7 iterations being depicted (four in sub-sequence 0 and three in sub-sequence 1).Sub-sequence 1 can only start its computations after all units finish their computations in sub-sequence0. The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory and arerepresented by the red arrows.
in performance.
The implementation of Viterbi’s algorithm in the proposed architecture thus results in a stationary
phase (inner loop) comprising an average of 23 instructions for an execution unit to complete an iteration
of the algorithm, with two cell being updated simultaneously in each unit. After all units reach the end of
the query sequence, the outer-loop will take 18 cycles until a new sub-sequence starts being aligned.
4.3 Summary
This chapter described the algorithm implementations for the SW and Viterbi algorithms, in the pro-
posed architecture.
These algorithms compute the sequence alignment of a reference sequence against a query se-
quence, exploiting an anti-diagonal parallelism. This processing scheme avoids any dependency be-
tween the cells being processed, thus increasing performance. The algorithms also take advantage of
the available proposed processor mechanisms, such as the sniffing mechanism, the shared registers
and the DSU, which parallelizes memory accesses.
Finally, the pseudo-codes for both algorithms are also presented in this chapter.
This chapter will detail the reference state-of-the-art architectures used to evaluate the benchmarked
algorithm applications: the SW and Viterbi algorithms. The implementation of these applications in the
evaluated architectures will also be detailed, together with their respective dataset.
A performance evaluation will be held for the presented architectures, followed by a performance and
energy efficiency evaluation, completing the evaluation tests.
5.1 Hardware Prototype
The proposed architecture was prototyped in a Zynq SoC 7020 FPGA [35]. The implemented config-
uration architecture issues, at each clock cycle, one bundle of instructions to four 32-bit execution units
and one DSU, each using vectorial instructions to process multiple cells in parallel. This results in a
128-bit wide VLIW, allowing the computation of the 16 8-bit (for the SW algorithm) or 8 16-bit (for the
Viterbi algorithm) cells in parallel, used by the considered benchmark algorithms. The register banks
and memories share the same width as the execution units, thus being composed by several cells in
each register and memory block.
The synthesis and place-&-route of the architecture was performed by using the Xilinx ISE 14.4
tool. The reported amount of occupied resources are presented in table 5.1. As can be observed, the
proposed architecture uses 6% of the Slice Registers, 50% of the Slice LUTs, and 5% of the BRAMs
available on the Zynq SoC 7020, achieving a maximum post-route operating frequency of 98.5 MHz.
By using the Xilinx Power Estimation tool [36], we further estimated the power consumption of the
proposed processor. Assuming worst-case conditions for flip-flop and memory updates, it results in a
power consumption of 0.584 W.
Table 5.1: Hardware resources, operating frequency and power estimation for the proposed architecture.Hardware Resources Used Total Utilization
Slice Registers 7135 106400 6%
Slice LUTs 26725 53200 50%
36-bit Block RAMs 7 140 5%
Frequency 98.5 MHz
Power 0.584 W
The amount of used Slice LUTs corresponds to 50% of the total LUTs available, and thus will be the
limiting factor of the processor scalability, when increasing the number of execution units or the vector
lengths. In fact, a scalability evaluation of the proposed architecture was performed, showcasing the
hardware requisites. Such study was conducted by changing the size of the vector in all execution units
from 32 to 40 bits (increase in DLP), and by including an additional execution unit (increase in ILP).
The increase of the vector width results in a 21.4% and 24.6% increase of slice registers and LUTs,
respectively, while the addition of one execution unit resulted in an increase of 23.3% and 29.9% in slice
registers and LUTs. The number of Block RAMs is only affected by the changes of the vector width,
increasing by one unit for every 16 bits added to the length of the vector.
54
Although the increase in hardware, the estimated power drops to 0.504 W (13.7%) when the vector
width increases to 40 bits, and to 0.563 W (4%) with the addition of an execution unit. This can be
explained by the significant drop in the operating frequency for both situations. Figure 5.1 summarizes
the hardware scalability results.
SliceRegisters
Slice3LUTs BRAMsFrequency
[MHz]Power3[W]
43328bit3units37baselinearchitecture1
7135 26725 7 9895 G9584
434G8bit3units 8662 333GG 7 6694 G95G4
53328bit3units 8795 34726 7 74 G9563
FPGA3Total 1G64GG 532GG 14G
G91
1
1G
1GG
1GGG
1GGGG
1GGGGG
1GGGGGG
Figure 5.1: Hardware scalability of the proposed architecture. The considered evaluations includedincreasing the width of the vectors, as well as increasing the number of execution units. The obtainedhardware resources, operating frequency and power estimation are presented.
5.2 Performance Evaluation
This section details the reference state-of-the-art architectures and compare them to the proposed
architecture, by using performance evaluation metrics. It also presents the application benchmarks and
respective used datasets.
5.2.1 Reference State-of-the-art Architectures
The proposed architecture was evaluated against three distinct state-of-the-art architectures rep-
resenting three distinct domains: i) mobile and low-power GPP; ii) high-performance GPPs; iii) pro-
grammable ASIP.
ARM Cortex-A9: Consists in a low-power GPP running at an operating frequency of 533 MHz. It
is integrated within the Zynq SoC 7020 FPGA (the same board used for the proposed architecture),
consisting in the PS of the SOC. Its architecture support out-of-order execution, with dual instruction
issue and 128-bit SIMD extensions. This allows issuing up to 2 instructions per clock cycle. In order to
take full benefit of all vector capabilities of the ARM processor, the processor’s SIMD extension (NEON
intrinsics [37]) is used.
55
Intel Core i7 3820: Consists on a high-performance GPP, running at a maximum frequency of 3.6
GHz. This processor uses a complex control structure capable of multiple instruction issue with out-of-
order and speculative execution (issuing up to 6 micro-ops per clock cycle [38]), achieving an average
of 2 Instructions Per Cycle (IPC) for the evaluated algorithms and respective datasets. The SSE2 SIMD
extension [38] was used with 128-bit wide vectors.
Bioblaze [23]: Consists in a dedicated ASIP, running at a frequency of 158 MHz. It uses a 128-bit
adapted SIMD extension ISA, and it was implemented in the same Zynq FPGA, for a fair comparison.
The SW algorithm was implemented in all architectures, while the Viterbi algorithm was only imple-
mented in the first two.
5.2.2 Application Benchmark
The benchmark applications consist in the previously introduced DP algorithms: the SW and Viterbi
algorithms. Both were implemented to solve sequence alignment problems between a query and refer-
ence sequences.
5.2.2.A Smith-Waterman
As it was described in the previous section, the considered implementation of the SW algorithm uses
8 bits for all symbols and scores. Given that the vector lengths are dimensioned to a maximum width of
128-bits, it results in a total of 16 8-bit cells being processed in parallel (4 cells per execution unit in the
proposed architecture).
The considered SW algorithm implementation was already detailed in chapter 4. To summarize, the
algorithm is parallelized along the anti-diagonal (in order to avoid data dependencies) and processed
along the query sequence, aligning smaller reference sub-sequences at a time. During the steady state
of the algorithm, this processing scheme results in only 5 clock cycles to process any cell of the DP
scoring matrix. In practice, and given the 16-cell parallelism, it results in 3.2 cells being computed each
clock cycle. This is made possible by the DSU, which parallelizes the memory accesses, removing
their impact from the inner loop of the algorithm. Nevertheless, the considered processing scheme
presents critical sections where it is required an extra memory access to migrate to a new reference
sub-sequence. However, these memory accesses can also be computed by the DSU, thus elimination
any performance impact caused by the outer loop.
For the remaining benchmarked architectures, the implemented SW algorithm follows Farrar’s im-
plementation [12], by using their SIMD ISA extensions with an equivalent vector length to the proposed
architecture (128-bits), to guarantee a fair comparison. Furthermore, only one core of each architecture
is used.
This implementation adopts a stripped access pattern processing scheme, along the query sequence
direction, where the computations are carried out in several separate F stripes that cover different parts
56
(a) Memory layout for the query profile.The vectors run parallel to the querysequence in a stripped pattern.
(b) Data dependencies between the last F vectorand the first.
Figure 5.2: Stripped pattern processing scheme and correspondent dependencies (Figures taken from[12]).
of the query sequence. Accordingly, the query is divided into F p-length segments, where p is given by
the number of vector elements that can be simultaneously accommodated in a SIMD register (see figure
5.2(a)). This results in a value of p equal to 16, for 8-bit data elements and 128-bit SIMD registers.
However, the data elements in this processing scheme are not fully independent, since the F seg-
ments have vertical dependencies to each other (see figure 5.2(b)). Hence, after all segments are
processed, a lazy loop is required to be executed in order to verify if any data hazards have occurred. If
there is a need for correction, a second pass of the loop is required to correct the errors, before a new
reference symbol is loaded for alignment. Although this loop is done in the outer loop of the algorithm
(after the query sequence is fully swiped), its performance impact is still very relevant, specially when
compared to the anti-diagonal processing scheme where no data dependencies occur and, therefore,
no lazy loops are required.
The extended SIMD-ISA offered by the Bioblaze ASIP is specially tailored for the SW algorithm.
Therefore, it results in an accelerated versions of the original Farrar’s implementation, since an efficient
fine-grain parallelism exploitation can be extracted. However, it still remains the same algorithm, with its
stripped processing scheme and the lazy loops.
Dataset
To benchmark the SW algorithm, a DNA dataset composed of several reference sequences (ranging
from 128 to 16384 elements) and a set of query sequences with length ranging from 20 to 2276 ele-
ments was used. The reference sequences correspond to twenty indexed regions of the Homo sapiens
breast cancer susceptibility gene 1 (BRCA1gene) (NC 000017.11). The query sequences were obtained
57
from a set of 22 biomarkers for diagnosing breast cancer (DI183511.1 to DI183532.1) and a fragment,
with 68 base pairs, of the BRCA1 gene with a mutation related to the presence of a Serous Papillary
Adenocarcinoma (S78558.1).
5.2.2.B Viterbi
The considered implementation of the Viterbi algorithm adopts a representation with 16 bits for all
symbols and scores. The vector lengths are dimensioned to a maximum width of 128-bits, which results
in a total of 8 16-bit cells being processed in parallel (2 cells in each execution unit for the proposed
architecture), corresponding to double the cell size that was adopted in the SW algorithm. This is due to
the higher precision requirements of the Viterbi algorithm versus the SW.
The Viterbi algorithm implementation on the proposed architecture was already described in detail
in chapter 4. Just like the SW, the algorithm is parallelized along the anti-diagonal and along the query
sequence, partitioning the reference sequence in smaller sub-sequences. As a result, during the steady
state of the algorithm, any given cell takes an average of 23 clock cycles to be computed (effectively
taking 2.875 clock cycles to compute a cell, given the 8-cell parallelism). This is made possible by the
DSU, which parallelizes the high number of memory accesses, removing their impact from the inner
loop of the algorithm. Additionally, the processing scheme presents critical sections whenever the end
of the query is reached. These critical sections will introduce a small computational delay (nonexistent
in the SW algorithm) in order to ensure the commitment of the data dependencies. Therefore, the outer
loop accounts for 3 additional inner loop iterations, or 69 clock cycles.
For the remaining evaluation platforms, it was used HMMER’s [9] Viterbi implementation. This im-
plementation follows a processing scheme very similar to Farrar’s implementation of the SW algorithm,
with the required modifications to suite the Viterbi algorithm. As such, the implementation follows the
same stripped access pattern processing scheme along the query sequence, where the computations
are carried out in several separate F stripes that cover different parts of the query sequence.
Differently to the SW algorithm, all dependencies for the match states in Viterbi’s algorithm depend
on scores from a previous row and column states (as seen in chapter 4). Therefore, HMMER’s im-
plementation uses a delayed load/store scheme that only stores the new values after the preemptive
load of the previous values. Although this algorithm inherently has more instructions than the SW, this
instruction reordering helps minimizing the number of required instructions at a cost of more storage.
Additionally, the lazy loops will still exist in the outer loop of the algorithm (whenever the end of the
query is reached). However, unlike its SW counterpart, the lazy loops in Viterbi’s algorithm are more
simple, with a lower impact on the resulting performance.
Dataset
To evaluate Viterbi’s algorithm implementation, a sample of 28 HMMs from the Dfam database of Homo
Sapiens DNA [39] were used. The adopted model lengths vary from 60 to 3000, increasing by a step of
roughly 100 model states. These models were created by the HMMER3.1b1 tool [9] and their complete
58
list is presented below (their length is prefixed to the model name):
M0063-U7 M0700-MER77B M1409-MLT1H-int M2204-CR1 Mam
M0101-HY3 M0804-LTR1E M1509-LTR104 Mam M2334-L1M2c 5end
A query sequence (generated by the HMMER tool) with a length of 10000 symbols was used to eval-
uate the alignment against all the above reference sequences. Additionally, in order to study the impact
of both the query and reference lengths in the algorithm performance, a sample of 17 generated query
sequences, with lengths ranging from 20 to 10000, was used to evaluate the algorithm’s performance in
the alignment against the longest reference sequence with a length of 2991 symbols.
5.2.3 Performance Evaluation
In the proposed architecture, both the RAM and the local fast memory are pre-loaded with the refer-
ence and query sequence (RAM), together with all the necessary constants and cost/score values (both
memories) required by the evaluated algorithms. Therefore, only the algorithm steps are accounted for
in the performed evaluations.
Accurate clock cycle measurements of the required time to execute each biological sequences anal-
ysis in the proposed platform were achieved by using the Xilinx ISim [40]. In the Bioblaze, the clock
cycle measurements were achieved by using Modelsim SE 10.0b [41]. In the ARM Cortex-A9 and the
Intel Core i7, the system timing functions were used to determine the total execution time of the DNA
sequence alignment. To improve the measurement accuracy, several repetitions of the same alignment
were done. The obtained values were subsequently divided by the number of repetitions and the pro-
cessor clock frequency.
The performance evaluation will then consist in two metrics: the number of Clock Cycles per Cell
Update (CCPCU) and the number of Cell Updates Per Second (CUPS).
5.2.3.A Smith-Waterman
Table 5.2 depicts the average number of clock cycles to complete the DNA sequence alignment in all
evaluated architectures, for the previously presented dataset. The resulted clock cycle ratios between
the reference architectures and the proposed architecture can be observed in the respective columns
(relating the observed differences in terms of clock cycles), which account for the affine model of the
algorithm.
The charts in figure 5.3 were drawn in order to study how number of clock cycles is affected by the
length of each sequence (both query and reference sequences). The plot in figure 5.3(a) represents the
number of clock cycles of the Bioblaze and the proposed architecture for an alignment between a fixed
59
Table 5.2: Average number of clock cycles for different DNA query sequences matched against a 4092element reference sequence, using the SW algorithm and the considered execution platforms, with therespective clock cycle ratios.
(a) Average number of clock cycles for the SW algorithm implementation using the Bioblaze and theproposed architecture, when considering a fixed query sequence composed of 64 symbols andmultiple reference sequences.
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
0,01
0,1
1
10
100
0 500 1000 1500 2000 2500
C.C
.gRat
io
Clo
ckgC
ycle
sg[x
10
^6]
QuerygLength
BioBlaze VLIW C.C.gRatio
(b) Average number of clock cycles for the SW algorithm implementation using the Bioblaze andthe proposed architecture, when considering a fixed reference sequence composed of 4096symbols and multiple query sequences.
Figure 5.3: Comparison of the average number of clock cycles for the SW algorithm implementationusing the Bioblaze and the proposed VLIW architecture, with different query and reference widths.
respectively, is achieved. This proves that the SW algorithm has a much better raw performance in the
proposed architecture than in the other architectures, showcasing the advantages of a better data-level
parallelism along the anti-diagonal.
In addition to the CCPCU comparison, the attained raw throughput (evaluated in Cell Updates per
Second (CUPS)) was also evaluated (see figure 5.4(b)). This metric accounts for the total number of
cells (given by the length of the query sequence (m) times the length of the reference sequence (n))
that are updated in a corresponding runtime (t), in seconds (accounting for the maximum operating
frequency in each implementation platform: (m × n)/t. Therefore, the higher the CUPS, the better the
performance.
The analysis of the MCUPS metric demonstrates that despite using a considerable lower operating
frequency than the other architectures, the proposed architecture achieves a throughput superior to both
the ARM (2.54x) and the Bioblaze (5.01x). However, as it would be expected, the Intel i7 achieves a
61
4,29
1,71,35
0,31
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
ARM-Cortex-A9 BioBlaze
Intel-Core-i7-3820 Proposed-Architecture
(a) Clock Cycles per Cell Update (CCPCU)
124,24
62,94
2274,07
315,18
1
10
100
1000
10000
ARMsCortex-A9 BioBlaze
IntelsCoresi7s3820 ProposedsArchitecture
(b) Mega Cell Updates per Second (MCUPS)
Figure 5.4: Performance evaluation results for the SW algorithm implementation in all evaluation archi-tectures.
much superior throughput (7.2x over the proposed architecture) given its much higher operating fre-
quency (31.17x over the proposed architecture).
5.2.3.B Viterbi
The average number of clock cycles for the Viterbi algorithm to execute the DNA sequence alignment
in the considered architectures are presented in table 5.3. This table depicts the results obtained for an
alignment between selected reference sequences from the dataset and a fixed query sequence with
a length of 10000 symbols. It also includes the respective clock cycle ratios between the reference
architectures and the proposed architecture (relating the observed differences in terms of clock cycles).
Table 5.3: Average number of clock cycles for different DNA reference sequences matched againsta 10000 element query sequence using the Viterbi algorithm, when implemented in the consideredexecution platforms.
Clock Cycles (c.c.) [×106]
Reference Proposed ARM CortexA9 c.c. Intel Core c.c.
Size Architecture (NEON) ratio i7 3820 ratio
200 6 130 22.624 46.05 7.68
472 14 311 22.911 54.17 3.29
900 26 565 21.822 95.17 3.66
1305 38 817 21.754 141.22 3.72
1727 50 1188 23.885 190.34 3.81
2204 63 1513 23.847 239.46 3.80
2532 73 1771 24.297 251.11 3.28
2991 86 2117 24.588 288.58 3.36
Similarly to what was done with the SW algorithm, figure 5.5 depicts additional plots representing the
average number of clock cycles and the corresponding variation for the Viterbi algorithm implementation
for several query-reference sets, when considering the proposed architecture and the ARM Cortex-A9.
The Intel architecture is not presented in order to provide better clarity. Additionally, since it implements
62
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
40,00
45,00
1
10
100
1000
10000
0 500 1000 1500 2000 2500 3000
C.C
.hRat
io
Clo
ckhC
ycle
sh[x
10
^6]
ReferencehLength
ARM VLIW C.C.hRatio
(a) Average number of clock cycles for a fixed query sequence composed of 10000 symbols andmultiple reference sequences.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
1
10
100
1000
0 2000 4000 6000 8000 10000
C.C
.nRat
io
Clo
cknC
ycle
sn[x
10
^6]
QuerynLength
VLIW ARM C.C.nRatio
(b) Average number of clock cycles for a fixed reference sequence composed of 2991 symbols andmultiple query sequences.
Figure 5.5: Comparison of the average number of clock cycles between the ARM Cortex-A9 and theproposed VLIW architecture, when executing the Viterbi algorithm with different query and referencewidths.
the same algorithm as the ARM Cortex-A9, it wields very similar instructions, resulting in a very similar
plot (after accounting for the performance differences).
The plot in figure 5.5(a) refers to the average number of clock cycles of a fixed query sequence
(composed by 10000 symbols) aligned against multiple references, while the plot in figure 5.5(b) refers
to the average number of clock cycles of a fixed reference sequence (with a length of 2991 symbols)
aligned against multiple query references. As it can be observed, the increase of the reference sequence
length leads to a very slow stabilization of the clock cycle ratio of the proposed architecture over the
ARM, reaching a value of 25. When varying the query sequence length, the clock cycle ratio stabilizes
63
very fast with the length increase, at a value of 24.6. These results demonstrate that the impact caused
by the critical sections in the outer loop of the algorithm implementation in the proposed architecture
is negligible, when compared to the other architectures, due to the slow rate of the clock cycle ratio
stabilization in the plot in figure5.5(a).
Figure 5.6(a) depicts the CCPCU metric. By following a trend entirely similar to the previously pre-
sented results, the proposed architecture achieves a number of CCPCU 23.4x lower than the ARM and