First Place Memocode'14 Design Contest Entry

Post on 01-Jul-2015

331 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is what I presented at the 2014 Memocode conference on Iowa State's winning design contest entry. The team was lead by me.

Transcript

A High Performance Systolic Architecture for k-NNClassification

Kevin Townsend, Philip Jones, Joseph Zambreno

Reconfigurable Computing LaboratoryIowa State University

MEMOCODE’14

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 1 / 11

Outline

1 The Competition

2 Our Approach

3 Hardware DesignPlatformSystolic ArrayProcessing ElementDot ProductSort

4 Results

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 2 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

MahalanobisProduct

k-NN

retret

print

0.6GB

1.3GB

64KB

128KB

256KB

Host Coprocessor

start time

end time

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11

Hardware Design Platform

The Convey Platform

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).

Duplicate the PE block as many times as possible.

Give each PE access to memory.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11

Hardware Design Platform

The Convey Platform

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).

Duplicate the PE block as many times as possible.

Give each PE access to memory.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11

Hardware Design Platform

The Convey Platform

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

kNNPE

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).

Duplicate the PE block as many times as possible.

Give each PE access to memory.

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11

Hardware Design Systolic Array

Systolic Arrays

testA testB trainA trainB ret

k-NNPE

k-NNPE

k-NNPE

k-NNPE

. . .

Solves routing problem

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 7 / 11

Hardware Design Processing Element

Single Processing Element

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Datain

/192 Data

out

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Buffer

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

≈ 1536 Registers

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Buffer

TestCache

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

660 Registers560 LUTs

≈ 1536 Registers

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Buffer TrainBuffer

TestCache

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

660 Registers560 LUTs

≈ 1536 Registers ≈1536 Registers≈768 LUTs

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Buffer TrainBuffer

TestCache Product

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

660 Registers560 LUTs

≈ 1536 Registers ≈1536 Registers≈768 LUTs

8704 Registers6806 Luts20 DSPs

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Processing Element

Single Processing Element

Buffer TrainBuffer

TestCache Product

Sort

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

out

660 Registers560 LUTs

316 Registers388 LUTs

7 BlockRAMs

≈ 1536 Registers ≈1536 Registers≈768 LUTs

8704 Registers6806 Luts20 DSPs

kNN PE

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11

Hardware Design Dot Product

Dot Product Pipeline

31, 12-bit subtracters

31, 24-bit subtracters

32, 13x25-bit multipliers

31, 45-bit adder tree

≈ 128 interger operators

150Mhz, 128 processingelements

2.4 billion operations persecond

testA

testB

trainA

trainB

pro

du

ct

Vec

tor

Su

btr

acte

rV

ecto

rS

ub

trac

ter

Vec

tor

Mu

ltip

lier

Ad

der

Tre

e

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11

Hardware Design Dot Product

Dot Product Pipeline

31, 12-bit subtracters

31, 24-bit subtracters

32, 13x25-bit multipliers

31, 45-bit adder tree

≈ 128 interger operators

150Mhz, 128 processingelements

2.4 billion operations persecond

testA

testB

trainA

trainB

pro

du

ct

Vec

tor

Su

btr

acte

rV

ecto

rS

ub

trac

ter

Vec

tor

Mu

ltip

lier

Ad

der

Tre

e

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11

Hardware Design Dot Product

Dot Product Pipeline

31, 12-bit subtracters

31, 24-bit subtracters

32, 13x25-bit multipliers

31, 45-bit adder tree

≈ 128 interger operators

150Mhz, 128 processingelements

2.4 billion operations persecond

testA

testB

trainA

trainB

pro

du

ct

Vec

tor

Su

btr

acte

rV

ecto

rS

ub

trac

ter

Vec

tor

Mu

ltip

lier

Ad

der

Tre

e

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11

Hardware Design Dot Product

Dot Product Pipeline

31, 12-bit subtracters

31, 24-bit subtracters

32, 13x25-bit multipliers

31, 45-bit adder tree

≈ 128 interger operators

150Mhz, 128 processingelements

2.4 billion operations persecond

testA

testB

trainA

trainB

pro

du

ct

Vec

tor

Su

btr

acte

rV

ecto

rS

ub

trac

ter

Vec

tor

Mu

ltip

lier

Ad

der

Tre

e

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

19

42

68

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product13

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

19

42

68

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

19

42

68

13

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

19

42

68

13

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

13

42

68

19

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

13

19

6842

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=100

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

13

19

42

68

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Hardware Design Sort

Sort

Counter

product

Bouncer

B3

B2

B1=68

B0

Inse

rter

RAM

V0

V1

V2

V3

out

7

13

19

42

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11

Results

Results

1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.

Actual runtime is 0.54 seconds.

Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11

Results

Results

1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.

Actual runtime is 0.54 seconds.

Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11

Results

Results

1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.

Actual runtime is 0.54 seconds.

Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11

top related