The Stratix™ 10 Highly Pipelined FPGA Architecture David Lewis, Gordon Chiu, Jeffrey Chromczak, David Galloway, Ben Gamsa, Valavan Manohararajah, Ian Milton, Tim Vanderhoek, John Van Dyken Altera Corporation, 150 Bloor St. W., Suite 400, Toronto, Ont., Canada M5S 2X9 [email protected]Abstract This paper describes architectural enhancements in the Altera Stratix™ 10 HyperFlex™ FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip- flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs. Keywords FPGA, logic module, routing 1. INTRODUCTION This paper describes core logic architecture enhancements in the Stratix™ 10 HyperFlex™ FPGA architecture. This device is manufactured in a 14nm FinFET CMOS process [16], and offers logic capacity of up to 5M equivalent 4 LUTs. While Moore’s law continues to offer density increases of approximately 2X per generation, it also introduces new challenges for FPGA architecture. Although RC delay per logical distance changes slowly with process, the RC delay per physical distance increases with process shrink, making it necessary for users to pipeline designs that span increasing logical area. This increases the demand for registers, as well as the importance of providing high speed long distance routing. Stratix 10 introduces a highly pipelined logic and routing fabric to address these problems. The key innovation compared to previous work on pipelined FPGAs is the introduction of a pulse latch based register in every routing multiplexer. The shift from pass gate to direct drive based routing enables a low-cost flip- flop embedded in the routing fabric, while introducing minimal delay when it is not used. While the cost of providing a flip-flop in every routing multiplexer is minor, other support not considered in previous papers increases the cost and has consequences on the rest of the fabric. Further extending on previous pipelined fabrics, the Stratix 10 architecture development included the exploration of clocking structures to handle the dozens of clocks present in real customer designs. The highly pipelined routing fabric also motivates changes to the logic element structure. Flip-flop control signals such as clear and clock enable affect retiming, and must be optimized differently. The routing fabric is also affected by the different properties of the 14nm process and the needs of a pipelined architecture, and is modified for better performance. The remainder of the paper is organized as follows. First, in Section 2, we give a brief overview of Stratix-style architectures. Section 3 provides an overview of some previous work in pipelined architectures, while Section 4 describes the Stratix 10 pipelined fabric. Section 5 details the CAD flow used for architecture exploration and in production tools. Section 6 describes pipelining experiments and Section 7 modifications to the logic element and routing. Section 8 provides a few examples of designs modified to target Stratix 10 and discusses the production tools that provides design modification advice. Section 9 concludes the paper. 2. STRATIX™ ARCHITECTURE To help understand the remainder of the paper, we provide a background on Altera architectures. Stratix architectures use logic elements (LEs) of different types arranged into logic array blocks (LABs). Each LAB contains some number of LEs, which in the case of Stratix II and later are adaptive logic modules (ALMs). The term LAB in this paper also refers to the programmable routing fabric associated with each group of LEs, so throughout this paper, a Stratix LAB means 10 ALMs and associated inter-LAB and intra-LAB routing. Figure 1 shows LABs and other embedded blocks such as memories and DSP blocks arranged in a row-column fashion, conceptually between the horizontal and vertical routing wires; in reality the wires pass over the LABs. Each block in the array can communicate with the inter-LAB routing on three of the four logical sides of the block. Each block output can drive onto either of the adjacent vertical routing channels and one horizontal routing channel, and the block inputs can receive signals from any of those three channels [11]. Figure 2 illustrates that routing wires are driven by multiplexers called driver input muxes (DIMs). DIMs select signals from other routing wires, to implement stitching and corner turning, as well as from the outputs of the LAB. Inputs to the LAB are provided by LAB input muxes (LIMs), which select from the nearby routing wires and LAB outputs, and drive the LAB lines. Inputs Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. FPGA'16, February21-23, 2016, Monterey, CA, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3856-1/16/02…$15.00 DOI: http://dx.doi.org/10.1145/2847263.2847267 159
10
Embed
The Stratix™ 10 Highly Pipelined FPGA Architecture · The Stratix™ 10 Highly Pipelined FPGA Architecture . David Lewis, Gordon Chiu, Jeffrey Chromczak, David Galloway, Ben Gamsa,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Stratix™ 10 Highly Pipelined FPGA Architecture David Lewis, Gordon Chiu, Jeffrey Chromczak, David Galloway, Ben Gamsa,
Valavan Manohararajah, Ian Milton, Tim Vanderhoek, John Van Dyken
Altera Corporation, 150 Bloor St. W., Suite 400, Toronto, Ont., Canada M5S 2X9
This paper describes architectural enhancements in the Altera
Stratix™ 10 HyperFlex™ FPGA architecture, fabricated in the
Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-
flops in the routing to enable a high degree of pipelining. In
contrast to the earlier architectural exploration of pipelining in
pass-transistor based architectures, the direct drive routing fabric
in Stratix-style FPGAs enables an extremely low-cost pipeline
register. The presence of ubiquitous flip-flops simplifies circuit
retiming and improves performance. The availability of
predictable retiming affects all stages of the cluster, place and
route flow. Ubiquitous flip-flops require a low-cost clock
network with sufficient flexibility to enable pipelining of dozens
of clock domains. Different cost/performance tradeoffs in a
pipelined fabric and use of a 14nm process, lead to other
modifications to the routing fabric and the logic element. User
modification of the design enables even higher performance,
averaging 2.3X faster in a small set of designs.
Keywords
FPGA, logic module, routing
1. INTRODUCTION This paper describes core logic architecture enhancements in the
Stratix™ 10 HyperFlex™ FPGA architecture. This device is
manufactured in a 14nm FinFET CMOS process [16], and offers
logic capacity of up to 5M equivalent 4 LUTs. While Moore’s
law continues to offer density increases of approximately 2X per
generation, it also introduces new challenges for FPGA
architecture. Although RC delay per logical distance changes
slowly with process, the RC delay per physical distance
increases with process shrink, making it necessary for users to
pipeline designs that span increasing logical area. This increases
the demand for registers, as well as the importance of providing
high speed long distance routing.
Stratix 10 introduces a highly pipelined logic and routing fabric
to address these problems. The key innovation compared to
previous work on pipelined FPGAs is the introduction of a pulse
latch based register in every routing multiplexer. The shift from
pass gate to direct drive based routing enables a low-cost flip-
flop embedded in the routing fabric, while introducing minimal
delay when it is not used. While the cost of providing a flip-flop
in every routing multiplexer is minor, other support not
considered in previous papers increases the cost and has
consequences on the rest of the fabric. Further extending on
previous pipelined fabrics, the Stratix 10 architecture
development included the exploration of clocking structures to
handle the dozens of clocks present in real customer designs.
The highly pipelined routing fabric also motivates changes to
the logic element structure. Flip-flop control signals such as
clear and clock enable affect retiming, and must be optimized
differently. The routing fabric is also affected by the different
properties of the 14nm process and the needs of a pipelined
architecture, and is modified for better performance.
The remainder of the paper is organized as follows. First, in
Section 2, we give a brief overview of Stratix-style
architectures. Section 3 provides an overview of some previous
work in pipelined architectures, while Section 4 describes the
Stratix 10 pipelined fabric. Section 5 details the CAD flow used
for architecture exploration and in production tools. Section 6
describes pipelining experiments and Section 7 modifications to
the logic element and routing. Section 8 provides a few
examples of designs modified to target Stratix 10 and discusses
the production tools that provides design modification advice.
Section 9 concludes the paper.
2. STRATIX™ ARCHITECTURE To help understand the remainder of the paper, we provide a
background on Altera architectures. Stratix architectures use
logic elements (LEs) of different types arranged into logic array
blocks (LABs). Each LAB contains some number of LEs, which
in the case of Stratix II and later are adaptive logic modules
(ALMs). The term LAB in this paper also refers to the
programmable routing fabric associated with each group of LEs,
so throughout this paper, a Stratix LAB means 10 ALMs and
associated inter-LAB and intra-LAB routing.
Figure 1 shows LABs and other embedded blocks such as
memories and DSP blocks arranged in a row-column fashion,
conceptually between the horizontal and vertical routing wires;
in reality the wires pass over the LABs. Each block in the array
can communicate with the inter-LAB routing on three of the
four logical sides of the block. Each block output can drive onto
either of the adjacent vertical routing channels and one
horizontal routing channel, and the block inputs can receive
signals from any of those three channels [11]. Figure 2
illustrates that routing wires are driven by multiplexers called
driver input muxes (DIMs). DIMs select signals from other
routing wires, to implement stitching and corner turning, as well
as from the outputs of the LAB. Inputs to the LAB are provided
by LAB input muxes (LIMs), which select from the nearby
routing wires and LAB outputs, and drive the LAB lines. Inputs
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. FPGA'16, February21-23, 2016, Monterey, CA, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3856-1/16/02…$15.00