SILVIO SEVERINO

PhD Graduate

PhD program:: XXXVII


co-supervisor: Prof. Emanuele Rodolà

Thesis title: Breaking Sequential Barriers: Parallel Decoding Methods for Efficient Transformers

Transformer-based language models have achieved remarkable capabilities in Natural Language Processing, yet their inference remains constrained by autoregressive decoding, where each token must be generated sequentially. This sequential dependency creates a fundamental bottleneck that scales linearly with output length and underutilizes parallel hardware, resulting in high latency and increased operational costs. While existing approaches such as Non-Autoregressive Translation models offer parallelization, they typically require extensive retraining, architectural modifications, and often sacrifice output quality. This thesis addresses the inference efficiency challenge through a novel perspective: reframing greedy autoregressive decoding as a triangular system of nonlinear equations solvable via parallel fixed-point iteration methods. We introduce a family of training-free parallel decoding algorithms: Parallel Jacobi, Parallel Gauss-Seidel-Jacobi, and Hybrid Gauss-Seidel-Jacobi. These methods update multiple tokens in parallel while mathematically guaranteeing convergence to the exact greedy output. These algorithms are model-agnostic and require no modifications to pretrained transformers, making them immediately deployable in production systems. Our investigation reveals that initialization is the principal lever controlling convergence speed in parallel fixed-point decoding. We develop Kickstart Decoding, a framework that seeds the solver with informative draft translations from lightweight sources including quantized models, word-by-word translation, and student--teacher architectures. Initialization-aware execution strategies further reduce computational overhead by freezing stable positions and concentrating updates on uncertain tokens. Evaluated extensively on Machine Translation benchmarks spanning high and low-resource language pairs, our methods achieve consistent speedups of two to three times over greedy autoregressive decoding on standard CPU hardware, while maintaining identical translation quality across BLEU, ChrF, and COMET metrics. Under quantized-model initialization, speedups reach 2.4--2.8x on WMT14 and WMT16, scaling further to 3x with increased parallel resources. Analysis reveals that initialization accuracy directly determines convergence speed, with iteration count decreasing superlinearly once prefix correctness exceeds 80% of the sequence. The contributions establish parallel fixed-point decoding as a practical, quality-preserving alternative to sequential generation that requires neither retraining nor architectural changes, offering immediate efficiency gains for researchers and practitioners with limited computational budgets while reducing the environmental footprint of language technology deployment.

Research products

11573/1690633 - 2024 - Sparse Vicious Attacks on Graph Neural Networks
Trappolini, Giovanni; Maiorca, Valentino; Severino, Silvio; Rodola, Emanuele; Silvestri, Fabrizio; Tolomei, Gabriele - 01a Articolo in rivista
paper: IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE (Piscataway NJ: IEEE) pp. 2293-2303 - issn: 2691-4581 - wos: (0) - scopus: 2-s2.0-85173066368 (5)

11573/1706544 - 2023 - Accelerating Transformer Inference for Translation via Parallel Decoding
Santilli, Andrea; Severino, Silvio; Postolache, Emilian; Maiorca, Valentino; Mancusi, Michele; Marin, Riccardo; Rodola, Emanuele - 04b Atto di convegno in volume
conference: The 61st Annual Meeting of the Association for Computational Linguistics (Toronto, Canada)
book: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) - ()

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma