Top Banner
Hardware Acceleration for Load Flow Computation Jeremy Johnson, Chika Nwankpa and Prawat Nagvajara Kevin Cunningham, Tim Chagnon, Petya Vachranukunkiet Computer Science and Electrical and Computer Engineering Drexel University Eigth Annual CMU Conference on the Electricity Industry March 13, 2012 Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 1 / 47
67

Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Mar 11, 2018

Download

Documents

trinhdiep
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Hardware Acceleration for Load Flow Computation

Jeremy Johnson, Chika Nwankpa and Prawat NagvajaraKevin Cunningham, Tim Chagnon, Petya Vachranukunkiet

Computer Science and Electrical and Computer EngineeringDrexel University

Eigth Annual CMU Conference on the Electricity IndustryMarch 13, 2012

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 1 / 47

Page 2: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 2 / 47

Page 3: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Problem

I Problem: Sparse Lower-Upper (LU) Triangular Decompositionperforms inefficiently on general-purpose processors (irregular dataaccess, indexing overhead, less than 10% FP efficiency on powersystems) [3, 12]

I Solution: Custom designed hardware can provide better performance(3X speedup over Pentium 4 3.2GHz) [3, 12]

I Problem: Complex custom hardware is difficult, expensive, andtime-consuming to design

I Solution: Use a balance of software and hardware accelerators (ex:FPU, GPU, Encryption, Video/Image Encoding, etc.)

I Problem: Need an architecture that efficiently combinesgeneral-purpose processors and accelerators

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47

Page 4: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Problem

I Problem: Sparse Lower-Upper (LU) Triangular Decompositionperforms inefficiently on general-purpose processors (irregular dataaccess, indexing overhead, less than 10% FP efficiency on powersystems) [3, 12]

I Solution: Custom designed hardware can provide better performance(3X speedup over Pentium 4 3.2GHz) [3, 12]

I Problem: Complex custom hardware is difficult, expensive, andtime-consuming to design

I Solution: Use a balance of software and hardware accelerators (ex:FPU, GPU, Encryption, Video/Image Encoding, etc.)

I Problem: Need an architecture that efficiently combinesgeneral-purpose processors and accelerators

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47

Page 5: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Problem

I Problem: Sparse Lower-Upper (LU) Triangular Decompositionperforms inefficiently on general-purpose processors (irregular dataaccess, indexing overhead, less than 10% FP efficiency on powersystems) [3, 12]

I Solution: Custom designed hardware can provide better performance(3X speedup over Pentium 4 3.2GHz) [3, 12]

I Problem: Complex custom hardware is difficult, expensive, andtime-consuming to design

I Solution: Use a balance of software and hardware accelerators (ex:FPU, GPU, Encryption, Video/Image Encoding, etc.)

I Problem: Need an architecture that efficiently combinesgeneral-purpose processors and accelerators

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47

Page 6: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Problem

I Problem: Sparse Lower-Upper (LU) Triangular Decompositionperforms inefficiently on general-purpose processors (irregular dataaccess, indexing overhead, less than 10% FP efficiency on powersystems) [3, 12]

I Solution: Custom designed hardware can provide better performance(3X speedup over Pentium 4 3.2GHz) [3, 12]

I Problem: Complex custom hardware is difficult, expensive, andtime-consuming to design

I Solution: Use a balance of software and hardware accelerators (ex:FPU, GPU, Encryption, Video/Image Encoding, etc.)

I Problem: Need an architecture that efficiently combinesgeneral-purpose processors and accelerators

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47

Page 7: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Problem

I Problem: Sparse Lower-Upper (LU) Triangular Decompositionperforms inefficiently on general-purpose processors (irregular dataaccess, indexing overhead, less than 10% FP efficiency on powersystems) [3, 12]

I Solution: Custom designed hardware can provide better performance(3X speedup over Pentium 4 3.2GHz) [3, 12]

I Problem: Complex custom hardware is difficult, expensive, andtime-consuming to design

I Solution: Use a balance of software and hardware accelerators (ex:FPU, GPU, Encryption, Video/Image Encoding, etc.)

I Problem: Need an architecture that efficiently combinesgeneral-purpose processors and accelerators

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 3 / 47

Page 8: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Goals

I Increase performance of sparse LU over existing software

I Design a flexible, easy to use hardware accelerator that can integratewith multiple platforms

I Find an architecture that efficiently uses the accelerator to improvesparse LU

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47

Page 9: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Goals

I Increase performance of sparse LU over existing software

I Design a flexible, easy to use hardware accelerator that can integratewith multiple platforms

I Find an architecture that efficiently uses the accelerator to improvesparse LU

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47

Page 10: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Goals

I Increase performance of sparse LU over existing software

I Design a flexible, easy to use hardware accelerator that can integratewith multiple platforms

I Find an architecture that efficiently uses the accelerator to improvesparse LU

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 4 / 47

Page 11: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Results and Contributions

I Designed and implemented a merge hardware accelerator [5]I Data rate approaches one element per cycle

I Designed and implemented supporting hardware [6]I Implemented a prototype Triple Buffer Architecture [5]

I Efficiently utilize external transfer busI Not suitable for sparse LU applicationI Advantageous for other applications

I Implemented a prototype heterogeneous multicore architecture [6]I Combines general-purpose processor and reconfigurable hardware coresI Speedup of 1.3X over sparse LU software with merge accelerator

I Modified Data Pump Architecture (DPA) SimulatorI Added merge instruction and supportI Implemented sparse LU on the DPAI Speedup of 2.3X over sparse LU software with merge accelerator

[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture forSparse Linear Solvers.

[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture forPower Flow Calculation.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47

Page 12: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Results and Contributions

I Designed and implemented a merge hardware accelerator [5]I Data rate approaches one element per cycle

I Designed and implemented supporting hardware [6]

I Implemented a prototype Triple Buffer Architecture [5]I Efficiently utilize external transfer busI Not suitable for sparse LU applicationI Advantageous for other applications

I Implemented a prototype heterogeneous multicore architecture [6]I Combines general-purpose processor and reconfigurable hardware coresI Speedup of 1.3X over sparse LU software with merge accelerator

I Modified Data Pump Architecture (DPA) SimulatorI Added merge instruction and supportI Implemented sparse LU on the DPAI Speedup of 2.3X over sparse LU software with merge accelerator

[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture forSparse Linear Solvers.

[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture forPower Flow Calculation.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47

Page 13: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Results and Contributions

I Designed and implemented a merge hardware accelerator [5]I Data rate approaches one element per cycle

I Designed and implemented supporting hardware [6]I Implemented a prototype Triple Buffer Architecture [5]

I Efficiently utilize external transfer busI Not suitable for sparse LU applicationI Advantageous for other applications

I Implemented a prototype heterogeneous multicore architecture [6]I Combines general-purpose processor and reconfigurable hardware coresI Speedup of 1.3X over sparse LU software with merge accelerator

I Modified Data Pump Architecture (DPA) SimulatorI Added merge instruction and supportI Implemented sparse LU on the DPAI Speedup of 2.3X over sparse LU software with merge accelerator

[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture forSparse Linear Solvers.

[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture forPower Flow Calculation.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47

Page 14: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Results and Contributions

I Designed and implemented a merge hardware accelerator [5]I Data rate approaches one element per cycle

I Designed and implemented supporting hardware [6]I Implemented a prototype Triple Buffer Architecture [5]

I Efficiently utilize external transfer busI Not suitable for sparse LU applicationI Advantageous for other applications

I Implemented a prototype heterogeneous multicore architecture [6]I Combines general-purpose processor and reconfigurable hardware coresI Speedup of 1.3X over sparse LU software with merge accelerator

I Modified Data Pump Architecture (DPA) SimulatorI Added merge instruction and supportI Implemented sparse LU on the DPAI Speedup of 2.3X over sparse LU software with merge accelerator

[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture forSparse Linear Solvers.

[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture forPower Flow Calculation.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47

Page 15: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Results and Contributions

I Designed and implemented a merge hardware accelerator [5]I Data rate approaches one element per cycle

I Designed and implemented supporting hardware [6]I Implemented a prototype Triple Buffer Architecture [5]

I Efficiently utilize external transfer busI Not suitable for sparse LU applicationI Advantageous for other applications

I Implemented a prototype heterogeneous multicore architecture [6]I Combines general-purpose processor and reconfigurable hardware coresI Speedup of 1.3X over sparse LU software with merge accelerator

I Modified Data Pump Architecture (DPA) SimulatorI Added merge instruction and supportI Implemented sparse LU on the DPAI Speedup of 2.3X over sparse LU software with merge accelerator

[5] Cunningham, Nagvajara. Reconfigurable Stream-Processing Architecture forSparse Linear Solvers.

[6] Cunningham, Nagvajara, Johnson. Reconfigurable Multicore Architecture forPower Flow Calculation.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 5 / 47

Page 16: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 6 / 47

Page 17: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Power Systems

I The power grid delivers electrical power fromgeneration stations to distribution stations

I Power grid stability is analyzed with PowerFlow (aka Load Flow) calculation [1]

I System nodes and voltages create a system ofequations, represented by a sparse matrix [1]

I Solving the system allows grid stability to beanalyzed

I LU decomposition accounts for a large part ofthe Power Flow calculation [12]

I Power Flow calculation is repeated thousandsof times to perform a full grid analysis [1]

Power Flow Execution Profile [12]

[1] A. Albur. Power System State Estimation: Theory and Implementation.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 7 / 47

Page 18: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Power Systems

I The power grid delivers electrical power fromgeneration stations to distribution stations

I Power grid stability is analyzed with PowerFlow (aka Load Flow) calculation [1]

I System nodes and voltages create a system ofequations, represented by a sparse matrix [1]

I Solving the system allows grid stability to beanalyzed

I LU decomposition accounts for a large part ofthe Power Flow calculation [12]

I Power Flow calculation is repeated thousandsof times to perform a full grid analysis [1]

Power Flow Execution Profile [12]

[1] A. Albur. Power System State Estimation: Theory and Implementation.[12] P. Vachranukunkiet. Power Flow Computation using Field Programmable Gate

Arrays.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 7 / 47

Page 19: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Power System Matrices

Power System Matrix Properties

System # Rows/Cols NNZ Sparsity

1648 Bus 2,982 21,196 0.238%

7917 Bus 14,508 105,522 0.050%

10278 Bus 19,285 134,621 0.036%

26829 Bus 50,092 351,200 0.014%

Power System LU Statistics

System Avg. NNZ Avg. Num % Mergeper Row Submatrix Rows

1648 Bus 7.1 8.0 65.6%

7917 Bus 7.3 8.5 54.3%

10279 Bus 7.0 8.3 55.8%

26829 Bus 7.0 8.7 62.7%

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 8 / 47

Page 20: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Power System LU

I Gaussian LUoutperforms UMFPACKfor power matrices [3]

I As matrix size grows,multi-frontal becomesmore effective

I The merge is abottleneck in GaussianLU performance [3]

I Goal: Design a hardwareaccelerator for the Mergein Gaussian LU

Gaussian LU vs UMFPACK on Intel Core i7at 3.2GHz [3]

1648 Bus 7917 Bus 10279 Bus 26829 Bus0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Power System

Spe

edup

[3] T. Chagnon. Architectural Support for Direct Sparse LU Algorithms.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 9 / 47

Page 21: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse Matrices

I A matrix is sparse when it has a large number of zero elements

I Below is a small section of the 26K-Bus power system matrix

A =

122.03 −61.01 −5.95 0 0−61.01 277.93 19.38 0 05.95 −19.56 275.58 0 00 0 0 437.82 67.500 0 0 −67.50 437.81

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 10 / 47

Page 22: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse Compressed Formats

I Sparse matrices use a compressed format to ignore zero elements

I Saves storage space

I Decreases amount of computation

Row Column:Value

0

1

2

3

4

122.030 -61.011 -5.952

-61.010 277.931 19.382

5.950 -19.561 275.582

437.823 67.504

-67.503 437.814

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 11 / 47

Page 23: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

A =

122.03 −61.01 −5.95 0 0−61.01 277.93 19.38 0 05.95 −19.56 275.58 0 00 0 0 437.82 67.500 0 0 −67.50 437.81

L =

1.00 0 0 0 0−0.50 1.00 0 0 00.05 −0.07 1.00 0 00 0 0 1.00 00 0 0 −0.15 1.00

U =

122.03 −61.01 −5.95 0 0

0 247.42 16.41 0 00 0 276.97 0 00 0 0 437.82 67.500 0 0 0 448.22

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 12 / 47

Page 24: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Methods

I Approximate Minimum Degree (AMD) algorithm pre-orders matricesto reduce fill-in [2]

I Multiple methods for performing sparse LU:I Multi-frontal

I Used by UMFPACK [7]I Divides matrix into multiple, independent, dense blocks

I Gaussian EliminationI Eliminates elements below the diagonal by scaling and adding rows

together

[2] Amestoy et al. Algorithm 837: AMD, an approximate minimum degree orderingalgorithm.

[7] T. Davis. Algorithm 832: UMFPACK V4.3 - An Unsymmetric-PatternMulti-Frontal Method.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 13 / 47

Page 25: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

Algorithm 1 Gaussian LU

for i = 1→ N dopivot search()update U()for j = 1→ NUM SUBMATRIX ROWS do

merge(pivot row , j)update L()update colmap()

end forend for

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 14 / 47

Page 26: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Column Map

I Column map keeps track of the rows with non-zero elements in eachcolumn

I Do not have to search matrix on each iteration

I Fast access to all rows in a column

I Select pivot row from the set of rows

I Column map is updated after each merge

I Fill-in elements add new elements to a row during the merge

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 15 / 47

Page 27: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:

I Indexing overheadI Column numbers are different from row array indicesI Must fetch column numbers from memory to operate on elementsI Cache missesI Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 28: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:I Indexing overhead

I Column numbers are different from row array indicesI Must fetch column numbers from memory to operate on elementsI Cache missesI Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 29: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:I Indexing overheadI Column numbers are different from row array indices

I Must fetch column numbers from memory to operate on elementsI Cache missesI Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 30: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:I Indexing overheadI Column numbers are different from row array indicesI Must fetch column numbers from memory to operate on elements

I Cache missesI Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 31: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:I Indexing overheadI Column numbers are different from row array indicesI Must fetch column numbers from memory to operate on elementsI Cache misses

I Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 32: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

7 3 2 8

4 1 5 2 +7 7 1 2 5 2 8 Row u+v

0 1 2 3 4 5 6 7Column

Row u

Row v

I Challenges with software merging:I Indexing overheadI Column numbers are different from row array indicesI Must fetch column numbers from memory to operate on elementsI Cache missesI Data-dependent branching

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 16 / 47

Page 33: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Sparse LU Decomposition

Merge Algorithm

i n t p , s ;i n t rnz =0, f n z =0;f o r ( p=1, s =1; p < p i vo tLen && s < nonp ivo tLen ; ) {i f ( nonp i vo t [ s ] . c o l == p i v o t [ p ] . c o l ) {merged [ rnz ] . c o l = p i v o t [ p ] . c o l ;merged [ rnz ] . v a l = ( nonp i vo t [ s ] . v a l − l x ∗ p i v o t [ p ] . v a l ) ;rnz++; p++; s++;

} e l s e i f ( nonp i vo t [ s ] . c o l < p i v o t [ p ] . c o l ) {merged [ rnz ] . c o l = nonp i vo t [ s ] . c o l ;merged [ rnz ] . v a l = nonp i vo t [ s ] . v a l ;r nz++; s++;

} e l s e {merged [ rnz ] . c o l = p i v o t [ p ] . c o l ;merged [ rnz ] . v a l = (− l x ∗ p i v o t [ p ] . v a l ) ;f i l l i n [ f n z ] = p i v o t [ p ] . c o l ;r nz++; fn z++; p++;

}}

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 18 / 47

Page 34: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 19 / 47

Page 35: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

LU Hardware

I Originally designed by Petya Vachranukunkiet[12]

I Straightforward Gaussian Elimination

I Parameterized for power system matrices

I Streaming operationshide latencies

I Computation pipelined for1 Flop / cycle

I Memory and FIFOs used forcache and bookeeping

I Multiple merge units possible

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 20 / 47

Page 36: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Pivot and Submatrix Update Logic

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 21 / 47

Page 37: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Custom Cache Logic

I Cache line is a whole row

I Fully-associative, write-back, FIFO replacement

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 22 / 47

Page 38: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Reconfigurable Hardware

I Field Programmable Gate Array(FPGA)

I Advantages:I Low power consumptionI ReconfigurableI Low costI Pipelining and parallelism

SRAM Logic Cell [13]

I Disadvantages:I Difficult to programI Long design compile timesI Lower clock frequencies

FPGA Architecture [13]

[13] Xilinx, Inc. Programmable Logic Design Quick Start Handbook.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 23 / 47

Page 39: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

ASIC vs FPGA

Application-Specific IntegratedCircuit (ASIC)

I Higher clock frequency

I Lower power consumption

I More customization

Field Programmable Gate Array(FPGA)

I Reconfigurable

I Design updates and corrections

I Multiple designs in same area

I Lower cost

I Less implementation time

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 24 / 47

Page 40: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 25 / 47

Page 41: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Merge Hardware Accelerator

I Input and Output BlockRAM FIFOs

I Two compare stages allow look-aheadcomparison

I Pipelined floating point adder

I Merge unit is pipelined and outputs oneelement per cycle

I Latency is dependent on the structureand length of input rows

+

U Stage 1 V Stage 1

U Stage 2 V Stage 2

U FIFO V FIFO

Merged

CMP

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 26 / 47

Page 42: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Merge Hardware Manager

I Must manage rows goingthrough merge units

I Pivot row is recycled

I Floating point multiplier scalespivot row before entering mergeunit

I Can support multiple mergeunits

I Input and output rotate amongmerge units after each full row

I Rows enter and leave in thesame order

MergeUnit

OutputFIFO

PivotFIFO

NonPivotFIFO

OutputFIFO

PivotFIFO

NonPivotFIFO

* *si si

MergeUnit

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 27 / 47

Page 43: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Merge Performance

Software Merge Performance on Core i7 at 3.2GHzPower System Average Data Rate Average Cycles

(Millions of Elements per Second) per Element

1648 Bus 90.77 35.3

7917 Bus 99.54 32.2

10279 Bus 100.75 31.8

26829 Bus 97.46 32.8

I Merge accelerator:I Outputs one element per cycleI Data rate approaches hardware clock rateI FPGA clock frequencies provide better merge performance than

processor

I Goal: Find an architecture that efficiently delivers data to theaccelerator

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 28 / 47

Page 44: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 29 / 47

Page 45: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Summary of Design Methods

I Completely Software (UMFPACK, Gaussian LU)I Inefficient performance

I Completely Hardware (LUHW on FPGA)I Large design time and effortI Less scalable

I Investigate three accelerator architectures:I Triple Buffer Architecture

I Efficiently utilize external transfer bus to a reconfigurable accelerator

I Heterogeneous Multicore ArchitectureI Combine general-purpose and reconfigurable cores in a single package

I Data Pump ArchitectureI Take advantage of programmable data/memory management to

efficiently feed an accelerator

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 30 / 47

Page 46: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Accelerator Architectures

CPU

Mem Mem

FPGA

I Common setup for processor and FPGA

I Each device has its own external memory

I Communicate by USB, Ethernet, PCI-Express, HyperTransport, ...

I Transfer bus can be bottleneck between processor and FPGA

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 31 / 47

Page 47: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Triple Buffer Architecture

CPU

Mem Mem

FPGA CPUFPGA

Buffer 1

Transfer BUS

Low Latency Memory Banks

Buffer 2

Buffer 3

I How to efficiently use transfer bus?

I Break data into blocks and buffer

I Three buffers eliminate competition on memory ports

I Allows continuous stream of data through FPGA accelerator

I Processor can transfer data at bus/memory rate, not limited by FPGArate

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 32 / 47

Page 48: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Heterogeneous Multicore Architecture

I Communication with an externaltransfer bus requires additionaloverhead

I Merge accelerator does notrequire a large external FPGA

I Closely integrate processor andFPGA

I Share same memory system

Reconfigurable Processor Architecture[8]

[8] Garcia and Compton. A Reconfigurable Hardware Interface for a ModernComputing System.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 33 / 47

Page 49: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Heterogeneous Multicore Architecture

DDR Memory

CPU(Microblaze)

AXI

DMA

Reconfigurable

Core

AXI LITE

AXI

AXISTREAM

Reconfigurable Processor Architecture[8]

[8] Garcia and Compton. A Reconfigurable Hardware Interface for a ModernComputing System.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 34 / 47

Page 50: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Heterogeneous Multicore Architecture

DDR Memory

CPU(Microblaze)

AXI

DMA

Reconfigurable

Core

AXI LITE

AXI

AXISTREAM

I Processor sends DMA requeststo DMA module

I DMA module fetches data frommemory and streams toaccelerator

I Accelerator streams outputsback to DMA module to storeto memory

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 35 / 47

Page 51: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Data Pump Architecture

Data Pump Architecture [11]

I Developed as part of the SPIRAL projectI Intended as a signal processing platformI Processors explicitly control data movementI No automatic data cachesI Data Processor (DP) moves data between external and local memoryI Compute Processor (CP) moves data between local memory and

vector processors

[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 36 / 47

Page 52: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Data Pump Architecture

Data Pump Architecture [11]

I Replace vector processors with merge accelerator

I DP brings rows from external memory into local memory

I CP sends rows to merge accelerator and writes results in local memory

I DP stores results back to external memory

[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 37 / 47

Page 53: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 38 / 47

Page 54: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

DPA Implementation

Data Pump Architecture [11]

I Implemented Sparse LU on DPA Simulator [10]

I DP and CP synchronize with local memory read/write counters

I Double buffer row inputs and outputs in local memory

I DP and CP at 2GHz, DDR at 1066MHz

I Single merge unit

[11] SPIRAL. DPA Instruction Set Architecture V0.2 Basic Configuration.[10] D. Jones. Data Pump Architecture Simulator and Performance Model.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 39 / 47

Page 55: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

DPA Implementation

Sparse LU DP

Initialize Matrix

in External Memory

Load row and column

counts, and column

map to Local Memory

Load pivot

row

Load other

submatrix rows

Wait for CP

< N?

Store updated

counts to

External Memory

Yes No

Store merged

rows to

External Memory

End

Sparse LU CP

Wait for DP to load counts

Wait for DP to load submatrix rows

Merge rows

Update column map

< N?Yes

No

End

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 40 / 47

Page 56: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

DPA Performance

I Increase performance bydouble buffering rows

I DP alternates loading totwo input buffers

I CP alternates outputs totwo output buffers

I DP and CP at 2GHz,DDR at 1066MHz

I Merge accelerator atsame frequency as CP

DPA Single vs Double Buffer SpeedupAgainst Gaussian LU

1648 Bus 7917 Bus 10279 Bus 26829 Bus0

0.5

1

1.5

2

2.5

Single Buffer Double Buffer

Power System

Spe

edup

vs

Gau

ssia

n LU

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 41 / 47

Page 57: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

DPA Performance

1648 Bus 7917 Bus 10279 Bus 26829 Bus0

0.5

1

1.5

2

2.5

Merge 10X Slower Merge 5X Slower Merge 2X Slower Merge Same as CP

Power System

Spe

edup

vs

Gau

ssia

n

I Measured with different mergefrequencies

I Slower merges do not providespeedup over software

I Faster merges require ASICimplementation

I Larger matrices see most benefit

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 42 / 47

Page 58: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

DPA Performance

1648 Bus 7917 Bus 10279 Bus 26829 Bus0

0.5

1

1.5

2

2.5

Merge 10X Slower Merge 5X Slower Merge 2X Slower Merge Same as CP

Power System

Spe

edup

vs

Gau

ssia

n

I DP and CP at 3.2GHz

I DDR at 1066MHz

DPA LU Speedup vs Gaussian LU for26K-Bus Power System

CP/DP 10 times 5 times 2 times Same asFreq slower slower slower CP

2.0GHz 0.57X 1.00X 1.85X 2.27X

3.2GHz 0.86X 1.45X 2.28X 2.35X

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 43 / 47

Page 59: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Overview

I Problem and Goals

I Background

I LU Hardware

I Merge Accelerator Design

I Accelerator Architectures

I Performance Results

I Conclusions

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 44 / 47

Page 60: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 61: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 62: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 63: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 64: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 65: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

Conclusions and Future Work

I Merge accelerator reduces indexing cost and improves sparse LUperformance (2.3X improvement)

I DPA architecture provides a memory framework to utilize mergeaccelerator and outperforms triple buffering and heterogeneousmulticore schemes investigated

I Software control of memory hierarchy allows alternate protocols suchas double buffering to be investigated

I Additional modifications such as multiple merge units, row cachingand prefetching will be required for further improvement

I With accelerator, high performance load flow calculation is possiblewith low power device

I Suggesting a distributed network of such devices

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 45 / 47

Page 66: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

References[1] A. Albur, A., Expsito.

Power System State Estimation: Theory and Implementation.Marcel Dekker, New York, New York, USA, 2004.

[2] P R Amestoy, Timothy Davis, and I S Duff.Algorithm 837: AMD, an approximate minimum degree ordering algorithm, 2004.

[3] T. Chagnon.Architectural Support for Direct Sparse LU Algorithms.Masters thesis, Drexel University, 2010.

[4] DRC Computer.DRC Coprocessor System User’s Guide, 2007.

[5] Kevin Cunningham and Prawat Nagvajara.Reconfigurable Stream-Processing Architecture for Sparse Linear Solvers.Reconfigurable Computing: Architectures, Tools and Applications, 2011.

[6] Kevin Cunningham, Prawat Nagvajara, and Jeremy Johnson.Reconfigurable Multicore Architecture for Power Flow Calculation.In Review for North American Power Symposium 2011, 2011.

[7] Timothy Davis.Algorithm 832:UMFPACK V4.3 - An Unsymmetric-Pattern Multi-Frontal Method.ACM Trans. Math. Softw, 2004.

[8] Philip Garcia and Katherine Compton.A Reconfigurable Hardware Interface for a Modern Computing System.15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007), pages 73–84, April2007.

[9] Intel.Intel Atom Processor E6x5C Series-Based Platform for Embedded Computing.Technical report, 2011.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 46 / 47

Page 67: Hardware Acceleration for Load Flow Computationelectriconf/2012/slides/Section D1-P2/6... · FPU, GPU, Encryption, Video/Image Encoding, etc.) I Problem: Need an architecture that

[10] Douglas Jones.Data Pump Architecture Simulator and Performance Model.Masters thesis, 2010.

[11] SPIRAL.DPA Instruction Set Architecture V0.2 Basic Configuration (SPARC V8 Extension), 2008.

[12] Petya Vachranukunkiet.Power flow computation using field programmable gate arrays.PhD thesis, 2007.

[13] Xilinx Inc.Programmable Logic Design Quick Start Handbook, 2006.

Johnson (Drexel University) HW Acceleration for Load Flow March 13, 2012 47 / 47