Training large deep learning models is challenging due to high communication overheads that distributed training entails. Embracing the recent technological development of programmable network devices, this talk describes our efforts to rein in distributed deep learning's communication bottlenecks and offers an agenda for future work in this area. We demonstrate that an in-network aggregation primitive can accelerate distributed DL workloads, and can be implemented using modern programmable network devices. We discuss various designs for streaming aggregation and in-network data processing that lower memory requirements and exploit sparsity to maximize effective bandwidth use. We also touch on gradient compression methods, which contribute to lower communication volume and adapt to dynamic network conditions. Lastly, we consider how to continue our research in light of the enormous costs of training large models at scale, which make it quite hard for researchers to tackle this problem area. We will describe our ongoing work to create a new approach to emulate DL workloads at a fraction of the necessary resources.
24/06/2024
When: June 24th 2024, 11:00
Where: Aula 1, via del Castro Laurenziano 7a
Bio: Marco does not know what the next big thing will be. He asked ChatGPT, though the answer was underwhelming. But he's sure that our future next-gen computing and networking infrastructure must be a viable platform for it. Marco's research spans a number of areas in computer systems, including distributed systems, large-scale/cloud computing and computer networking with emphasis on programmable networks. His current focus is on designing better systems support for AI/ML and providing practical implementations deployable in the real world. Marco is an Associate Professor of Computer Science at KAUST. Marco obtained his Ph.D. in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior research scientist at Deutsche Telekom Innovation Labs & TU Berlin. Before joining KAUST, he was an assistant professor at UCLouvain. He also held positions at Intel, Microsoft and Google.