Annapolis Wildstar FPGA Board
Post on 02-Feb-2016
30 Views
Preview:
DESCRIPTION
Transcript
Annapolis Wildstar FPGA Board
Charles RossMonica Chawathe
Wildstar Board
Starfire Board
Virtex2000E
“1”
Virtex2000E
“2”
Virtex2000E
“0”
1M
1M 1M
1M
2M2M
2M2M
2M2M
2M2M
Host
3 Virtex 2000E FPGAs, 12 Memories (20 MB)
WildStar Board (Simplified)
LAD Bus
Host
LAD Bus
Virtex1000“1”
1M
1M1M1M
1M1M
Host
1 Virtex 1000 FPGA, 6 Memories (6 MB)
StarFire Board (Simplified)
LAD Bus
Memory Layout Local
Always 32-bit words Two on PE 1 Two on PE 2
Mezzanine 32 or 64, depending on source (PEx / PE0)
Both address and word size 4 between PE 1 & 0 4 between PE 2 & 0
Latency: 4 cycles
Mezzanine Memory 32 vs 64 (Same memory) Switch Modes
00 Straight 01 Crossed 10 Lo Thru 11 Hi Thru
MemMem
PEx PE0
64 32
PEx (1 and 2)
Right
Left
RightLocal
LeftLocal
RightMezz
LeftMezz
LAD
STUFF
PE0
Right
Left
PE1RightMezz
PE1LeftMezz
PE2RightMezz
PE2LeftMezz
LAD
STUFF
Clocks – 4 of them!? K, M, P, U
KClock LAD Transactions (K?) MClock Memory Transactions PClock Processing Clock UClock User Clock
Okay, but why? What are they?
KClock – LAD PE Host 33MHz or 66MHz
33MHz – Easy to Place and Route 66MHz – 2X Host Bandwidth Host and Chip must agree!!
Set in VHDL and Host Code Clock is actually based on PCI Clock
Varies per host Ours is approx. 33.23MHz / 66.46MHz
Asynchronous to all other clocks
MClock – Memory Speed of Memory IO
Both Local & Mezzanine User Selectable
25MHz – 133MHz Wildstar 25MHz – 100MHz Starfire
PClock – Processing Based on MClock
Divisor between 1-16 Slower than MClock (Or Equal)
Can “Speed up” Memory I/O Decoupling may allow different Speeds Increase M, increase Divisor Ex: Slow Component in Application (30MHz)
M=30Mhz & Divisor = 1 P=30MHz M=60Mhz & Divisor = 2 P=30MHz
2 Memory Accesses per Clock
PClock – Processing (More) Optional
We normally don’t use it for ease MClock is used Directly
Less Logic than “P=M/1” No need to jump Clock Boundaries
Chip must either Not care what the ratio is Know at compile what ratio will be
UClock – User Clock User Selectable
0.32MHz – 133MHz Wildstar 0.32MHz – 100MHz Starfire
We have never used it 3 is plenty, isn't it?
Asynchronous to all other clocks
Hardware Components Roll your own
Manual LAD addressing (33/66 Differ) Manual Memory use Contention Manual EVERYTHING! CAN be very fast ~140 MHz
Annapolis Supplied Components MUCH Easier Slower (Approx. 40-60 MHz)
LAD Bus 33MHz / 66MHz Selectable
Changes the communication protocol Amt of Latency, etc..
Component Addressing scheme 0x0000-0x7FFF – Component Within PE Higher Bits Address Board and PE
Ignore them unless you “roll your own” LAD code
LAD Bus (More) The Addressing of the LAD bus
A lot like subnet masks in IP Networking MASK
Which bits address the component Which bits are intra-component
BASE Where does this component begin
ADDR&MASK==BASE “Are you talkin’ to ME?” ADDR&(~MASK) = “What address in me?”
Examples: B: 0x4800 M:0x7F00 0x4800 ~ 0x48FF B: 0x3200 M:0x7C00 0x3200 ~ 0x35FF
Inside the Chips
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
AnnapolisProvided
UserProvided
LAD-MUX
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
LAD-MUX Gives LAD access to components
Bridges gap between IO Pins and “Logical” LAD
Handles Protocols for you 66 and 33
ONE per chip
Reset
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
Reset Allows Host to RESET the Chip
Causes clocks to destabilize momentarily
Causes chip to return to known init state
(If you write your VHDL right) All Annapolis components are written
right
Clocks
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
Clocks Provides user access to
All 4 Clocks (or Clock x2) When clocks are stable
“DLL locked” Signals
Clocks on a Virtex use DLLs Delay-Locked Loop not Dynamic Link Library
Shame on you windows users!
Register File
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
Register File Provides host access to 1-D array
of 32-bit registers Size must be a power of 2
Can be used for: Ready – “The host says I can go now” Done – “Hey Host, I am done!” Small 32-bit IO – “The answer is 42!” Run time parameters – “Threshold is
63”
LAD to Mem Bridge
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
LAD to Mem Bridge Provides host with access to the
memories Mezzanine or Local Memories 2 Kinds, 32 and 64
Transfers happen in bursts 256 DWORDS for 32 bit memories 512 DWORDS for 64 bit memories (its all transparent to the user though)
Memory-Mux
SomeMemory
LAD
MemMux
LAD-MemBridge
LADMux
RegFileReset
Clocks
SomeMemory
MemMux
LAD-MemBridge
Your Your ApplicatiApplicati
onon
.....
..........
Memory-Mux Provide multiple clients with access to
the memories Arbitrates between clients
Priority Number of the client decides priority Maximum utilization Might starve some clients
Fair Round Robin Wastes some cycles Each Client gets 1/n
Memory Access Address of DWORD or QWORD Data_Out To Memory Data_In From Memory Write Direction of Request Request “I want memory” Acknowledge “Okay!” Data_Valid 4/5 Cycle Delayed Ack (See Bugs Later)
32 bit Memories Only Low/High Enable “This half is useful”
64 bit Memories Only High/Low_Data_Valid 4/5 Cycle Delayed (Ack & Low/High Enable)
64 bit Memories Only
32-bit Memory Read
64-bit Memory Read
32-bit Memory Write
64-bit Memory Write
Others - Useful RAM Blocks
Host and Client Access to on-chip memories 256 32-bit words
Interrupts to host Systolic Buses
2 36-bit busses between PE1 and PE2 top and bottom
Bi-directional Tri-state
PE0 Standard Buses 2 2-bit busses between PE0 and Pex Bi-directional
Tri-state
Others – Useless LED (there are 2 LEDs per Chip)
Red and Green Cant see them…
IO Card 114 bit IO We don’t have one
Test Pins 18 bits No testing our board, please! =)
Software API Annapolis Supplied Driver Functions
Open, Close, Set Clocks, DMA, Read, Write, Download Configurations, Interrupt, Readback, etc..
Convenience Functions Interface code to the
“Lad to Memory Bridges”
Open/Close Grabs the board exclusively
Uses kernel mutex CAN do it in shared mode, but DONT
Can set LAD Speed as well See “Bugs” Later
Chip Configuration Programs a PE from a memory array containing
the bitstream x86 files
Can de-program as well Why bother?
As long as everyone “Plays nice”
BE CAREFUL WHAT YOU PROGRAM! if you program a PE with a bitstream that is corrupted,
or not for the correct chip, or mangled in some way you can release the magic smoke from the chips!
$40,000 board!
Set Clock Speeds UClock speed MClock speed
and PClock divisor
Register IO Reads/Writes to the LAD Address
space to communicate with anything
plugged into a LAD MUX Reset Register Files Etc.
Memory IO for LAD to MEM Bridges Abstracts the IO Bursts,
addressing, etc. Create Memory Objects Read/Write/Copy/Set Release
Others You Wont Need Display (4 Char LCD on the board) Interrupts Temperature / Power Readback / Singleshot DMA Versions / Hardware Config Etc..
Tools You write Host code (in c)
compile with gcc, etc. Link in the libraries and such
You write Chip code (in VHDL) Simulate and Verify with ModelSim Synthesize with Synplify
Linux / Solaris / WinNT Place and Route with Xilinx foundation tools
WinNT / Linux (with wine)
ModelSim VHDL Simulation tool Annapolis provides
Host simulation components VHDL Description of the WHOLE board
LAD Memories (Local & Mezzanine) Busses Etc
You provide VHDL to run inside the chip
(May contain Annapolis components as well) Talk to me if you want to use ModelSim to debug!
Synplify Synplicity Inc. Converts VHDL (or Verilog) into an
EDIF EDIF = description of your program in
terms of virtex parts (4 input LUTs, FlipFlops, Ramblocks, Etc)
Fast 1-30 minutes
Place and Route Maps to lower level components Lays them out Routes between them Slow
10 minutes – 2 days Provides a bitstream (.bit file)
directly converted to .x86 for config
Paths & Environment Need environment variables and path
additions add this to the end of your your .cshrc:
source ~cs670/WildExamples/cshrc_additions
If you use bash, sh, zsh, etc.. You’re on your own! Look at the file, figure it out!
OR Use csh or tcsh!
Examples ~cs670/WildExamples/csu_example
Basic CSU made example using only PE1 Copies 1Mb from Left Right Local Mem
~cs670/WildExamples/annap_example All the Annapolis supplied examples May need path adjusting, etc.. Not meant to work as is Useful to get a feel for other stuff
Hints Timing
Count MClock, and put it in a RegFile Cycles / Freq = Time
Host timing is too coarse “Start / Stop” and “Working /
Done” Use a RegFile – Easier than Interrupts
(Haven’t gotten them to work with LAD Mux)
Manuals Ask Sanjay! =)
1 copy of our HUGE Starfire / Wildstar manuals
I have the original… You may use it near my desk… If it wanders from my cube
Broken Legs
HELP! Bugs? - “99% correct is 100% Wrong”
1 – Reread your VHDL and host code Silly bugs are easy to make, and spot
2 – Simulate it You can see the signals. It almost always agrees
with the actual hardware 3 – Simulate again
No Really… Simulate it! 4 – Look in the manuals
Helpful sometimes… 5 – rossc@cs.colostate.edu
BUGS!!!! Querying the LAD bus speed in host code
will return 66MHz if the LAD Bus was *EVER* at 66MHz since last reboot… even if it is *CURRENTLY* at 33MHz!
DON’T USE IT, EVER! The Data_Valid Signals are WRONG! They
appear to be delayed 5 cycles instead of 4 in the real code. They are correct in simulation.
Use a 4 cycle delay on (Req and Ack) Instead! Use the simulation to ensure your delayed signal
matches
Lets Look at it! Lemme open emacs… VHDL Host Code Execution Simulation
Little Wiggly Green Wires!
top related