Thesis title: A novel dataflow programming model bridging network and computation in FPGA-based accelerators
One of the main limitations towards the delivery of exascale (and beyond) computing systems
is represented by the nowadays dominant computing model, i.e. the one defined by von Neumann
in the late 1940’s in which a program is stored in a memory and out of which a clock sequences
instructions that move data elements in and out of a memory, and devices. Processors have advanced in
terms of the number of operations that can be issued per clock cycle, and the order in which they are issued,
but fundamentally the model has not changed since it was first conceived.
With respect to power consumption, the flexibility of this model, however, creates a significant overhead
for each operation due to the cost of instruction processing and the need to store intermediate values into a
memory hierarchy between operations. The need to hold intermediate values in memory generates a 10-fold overhead
in the energy cost of an operation vs the cost of supporting the instruction based
scheduling of the operator. Similarly, the model’s need to move intermediate values through a memory hierarchy,
even if cached, introduces a bottleneck on memory movement, and since HPC applications can easily exceed the capacity of a
cache, an additional 100-fold energy overhead per operator is consumed if the resulting value needs to move through off-chip memory.
The data-flow model of computation differs from the von-Neumann machine in that the elements
of data flow through a sequence of operations with intermediate values able to move directly
between them. Today, at the meta-level, dataflow applications are defined with task level
operators, and with the tasks implemented using the traditional model. The FPGA provides a
processing fabric on which data-flow applications can flow data at the DSP operation level
with intermediate values either moving directly through a pipeline, or stored typically as a
FIFO, in the efficient on-chip memories. Since FPGAs are also reconfigurable, a level of
flexibility can be brought back into the application
so that different sequences of operations can be defined. Also, in addition to pipelining
directly a single sequence of operators, multiple flows can be executed in parallel. Hardware
description languages are ideal in specifying such applications, however, their complexity is
beyond that of today’s HPC application software programmers not to mention the scientific HPC
users. My aim is to prototype a High Level Synthesis based paradigm enabling the design
and deployment of multi-FPGA applications by defining a set of intercommunication processing
tasks in C/C++ languages.
The usage of FPGAs as accelerators is getting so widespread that even big cloud providers are
now installing reconfigurable devices in their instances (e.g. on Microsoft Azure and Amazon EC2).
Interaction of hundreds to thousands of FPGAs require a scalable approach to hold them
together, allowing a low latency connection among them, but a definitive approach has to be
found to let users make the most of their flexibility and in the meanwhile easing the usage
for software developers.
As an example, the latest version of the Microsoft Catapult fabric, puts a Stratix 10 device
between each NIC on the X86 servers and the ToR switch, enabling a fast path for accelerators
to communicate among themselves with a few microseconds latency. The Brainwave project
leverages this architecture to provide a deep learning platform for real-time AI
inference on the cloud. While this framework offers a very friendly interface for users to
deploy their models on top of this architecture, it loses the flexibility delivering the
cores as black boxes, providing an implementation of only a few pre-trained models.
Our approach, on the other hand, let users to have full control of the platform, allowing the implementation of custom acceleration cores,
still maintaining ease of usage by supplying a set of interfaces to integrate with the HLS tools developed in the project.
Thus will allow HPC application developers to define a scalable application according to a streaming
programming model (Kahn Process Network, KPN) that can be efficiently deployed on a multi-FPGAs system.
An important goal of this work is the development of a communication IP and its software stack, providing the implementation of a
direct network that will allow the low-latency communication between FPGA accelerated tasks deployed on different FPGAs possibly hosted in different nodes.
The communication IP will be based on the ExaNet IPs (switch, router, high speed channels) developed in the ExaNeSt H2020 project.
The direct communication between FPGAs allows to avoid the involvement of the CPUs and system bus resources
in the data transfers, improving the overall energy efficiency of the platform.
The availability of a low and deterministic latency FPGA-based network stack is a fundamental requirement
also for the class of real-time stream processing applications that characterize High Energy Physics (HEP)
experiments.
In that context, Trigger and Data Acquisition (DAQ) Systems must cope with many raw data streams coming
from the detectors while offering a reliable and real-time response, leveraging the inherent determinism
provided by FPGA-implemented data transport and processing tasks.
An example of this approach is given by the INFN NaNet project, where an FPGA device
gathers and processes multiple data streams arriving from the detectors of the NA62 HEP experiment to one
of its 10GbE ports, and forward them to a many-core GPU system for further processing through the PCI
Express system bus.
During my PhD, most of the work was done in the context of the \exanest and \nanet projects, whose
needs heavily influenced and inspired the streaming paradigm that will be presented in this dissertation.
For this reason the first two chapters will be dedicated to the description of these project, on which
I contributed during these years, and the innovative ideas they carry.
This will serve to depict the experimental setting and state reference parameters for comparing and
evaluating the results.
My research activity was carried out at the APE Lab of Istituto Nazionale di Fisica Nucleare (INFN),
which has been active in the development of custom parallel machines for computational physics simulations (APE,
ape100, APEmille and apeNEXT).
Leveraging the acquired know-how with torus interconnects, the groups has then shifted
its activities on custom FPGA-based Network Interface Cards (NIC) and on design of heterogeneous architectures.