Titolo della tesi: From detailed acoustic analysis to AI: designing and developing advanced speech analysis tools
The modernization of the xkl software, originally developed by Dennis Klatt at MIT in the 1980s, was a major goal of this research work. The introduction of a new Graphical User Interface (GUI), using GTK libraries, simplified the installation process but most importantly made the software accessible and user-friendly on various platforms, including Windows, Linux, and MacOS. The xkl refurbishment also addressed the inclusion, in the spectrum processing tools of the so-called reassigned spectrogram, allowing thus for improved detailed examination of speech spectra. In the current xkl version, formant values are now automatically saved in a text file, which facilitates largescale analysis, especially for vowels studies. The development of a modern xkl speech analysis tool
was part of the LaMIT project [1], that has the goal of applying Stevens lexical access model [2] to the Italian language. One major innovation introduced by Stevens is the concept of landmark, that is, the presence of privileged regions in the time domain at which a primary phase of the perceptual process would take place, the landmark positions. In this work, an automatic vowel landmark detector was developed. This landmark recognition system was developed and implemented based on a Convolutional Neural Network combined with a Recurrent Neural Network, i.e. a CNNRNN hybrid model. The CNN-RNN recognizer used a set of parameters that combined energy measurements and Mel-spectrum descriptors, and was run on the above sentences. The recognizer
was tested on sentences of the LaMIT database [3], a corpus formed by 800 spoken utterances (4 native italian speakers) that were manually analyzed by examining the corresponding speech waveforms but most importantly using the xkl tool that provided invaluable information on spectral general properties and time-varying spectral properties. It is thanks to this analysis that the corpus was manually labeled and contains now information about landmark presence, landmark type, and landmark position in time. The output of the recognizer produced an estimation of detected vowel landmarks. This output was compared against the manually estimated vowel landmark presence. The overall recognition rate was 74.98%. For individual speakers the recognition rate ranged
from about 72% to about 77%. Artificial intelligence methods were also applied to automatic foreign accent identification [4]. A Multi-Kernel Extreme Learning Machine (MKELM) model, along with a weighted scheme, was proposed for application to the recognition of 5 different accents (Arabic, Chinese, Korean, Spanish, French) in American English. The recognition was based on Mel-frequency cepstral coefficients (MFCC) and prosodic features (Pitch, Energy). The proposed model achieved an accuracy rate of 84.72% using a paired weighting scheme. In contrast, the accuracy rate dropoped to 66.5% when employing the traditional non-weighted multi-classification scheme. A comparison against other other state-of-the-art classification methods showed significant
advantages of the proposed model.