PAOLO CRETARO

PhD Graduate

PhD program:: XXXII



Thesis title: A novel dataflow programming model bridging network and computation in FPGA-based accelerators

One of the main limitations towards the delivery of exascale (and beyond) computing systems is represented by the nowadays dominant computing model, i.e. the one defined by von Neumann in the late 1940’s in which a program is stored in a memory and out of which a clock sequences instructions that move data elements in and out of a memory, and devices. Processors have advanced in terms of the number of operations that can be issued per clock cycle, and the order in which they are issued, but fundamentally the model has not changed since it was first conceived. With respect to power consumption, the flexibility of this model, however, creates a significant overhead for each operation due to the cost of instruction processing and the need to store intermediate values into a memory hierarchy between operations. The need to hold intermediate values in memory generates a 10-fold overhead in the energy cost of an operation vs the cost of supporting the instruction based scheduling of the operator. Similarly, the model’s need to move intermediate values through a memory hierarchy, even if cached, introduces a bottleneck on memory movement, and since HPC applications can easily exceed the capacity of a cache, an additional 100-fold energy overhead per operator is consumed if the resulting value needs to move through off-chip memory. The data-flow model of computation differs from the von-Neumann machine in that the elements of data flow through a sequence of operations with intermediate values able to move directly between them. Today, at the meta-level, dataflow applications are defined with task level operators, and with the tasks implemented using the traditional model. The FPGA provides a processing fabric on which data-flow applications can flow data at the DSP operation level with intermediate values either moving directly through a pipeline, or stored typically as a FIFO, in the efficient on-chip memories. Since FPGAs are also reconfigurable, a level of flexibility can be brought back into the application so that different sequences of operations can be defined. Also, in addition to pipelining directly a single sequence of operators, multiple flows can be executed in parallel. Hardware description languages are ideal in specifying such applications, however, their complexity is beyond that of today’s HPC application software programmers not to mention the scientific HPC users. My aim is to prototype a High Level Synthesis based paradigm enabling the design and deployment of multi-FPGA applications by defining a set of intercommunication processing tasks in C/C++ languages. The usage of FPGAs as accelerators is getting so widespread that even big cloud providers are now installing reconfigurable devices in their instances (e.g. on Microsoft Azure and Amazon EC2). Interaction of hundreds to thousands of FPGAs require a scalable approach to hold them together, allowing a low latency connection among them, but a definitive approach has to be found to let users make the most of their flexibility and in the meanwhile easing the usage for software developers. As an example, the latest version of the Microsoft Catapult fabric, puts a Stratix 10 device between each NIC on the X86 servers and the ToR switch, enabling a fast path for accelerators to communicate among themselves with a few microseconds latency. The Brainwave project leverages this architecture to provide a deep learning platform for real-time AI inference on the cloud. While this framework offers a very friendly interface for users to deploy their models on top of this architecture, it loses the flexibility delivering the cores as black boxes, providing an implementation of only a few pre-trained models. Our approach, on the other hand, let users to have full control of the platform, allowing the implementation of custom acceleration cores, still maintaining ease of usage by supplying a set of interfaces to integrate with the HLS tools developed in the project. Thus will allow HPC application developers to define a scalable application according to a streaming programming model (Kahn Process Network, KPN) that can be efficiently deployed on a multi-FPGAs system. An important goal of this work is the development of a communication IP and its software stack, providing the implementation of a direct network that will allow the low-latency communication between FPGA accelerated tasks deployed on different FPGAs possibly hosted in different nodes. The communication IP will be based on the ExaNet IPs (switch, router, high speed channels) developed in the ExaNeSt H2020 project. The direct communication between FPGAs allows to avoid the involvement of the CPUs and system bus resources in the data transfers, improving the overall energy efficiency of the platform. The availability of a low and deterministic latency FPGA-based network stack is a fundamental requirement also for the class of real-time stream processing applications that characterize High Energy Physics (HEP) experiments. In that context, Trigger and Data Acquisition (DAQ) Systems must cope with many raw data streams coming from the detectors while offering a reliable and real-time response, leveraging the inherent determinism provided by FPGA-implemented data transport and processing tasks. An example of this approach is given by the INFN NaNet project, where an FPGA device gathers and processes multiple data streams arriving from the detectors of the NA62 HEP experiment to one of its 10GbE ports, and forward them to a many-core GPU system for further processing through the PCI Express system bus. During my PhD, most of the work was done in the context of the \exanest and \nanet projects, whose needs heavily influenced and inspired the streaming paradigm that will be presented in this dissertation. For this reason the first two chapters will be dedicated to the description of these project, on which I contributed during these years, and the innovative ideas they carry. This will serve to depict the experimental setting and state reference parameters for comparing and evaluating the results. My research activity was carried out at the APE Lab of Istituto Nazionale di Fisica Nucleare (INFN), which has been active in the development of custom parallel machines for computational physics simulations (APE, ape100, APEmille and apeNEXT). Leveraging the acquired know-how with torus interconnects, the groups has then shifted its activities on custom FPGA-based Network Interface Cards (NIC) and on design of heterogeneous architectures.

Research products

Connessione ad iris non disponibile

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma