ELEONORA AMMATURO

Dottoressa di ricerca

ciclo: XXXVII


supervisore: Domenico Vitulano
co-supervisore: Vittoria Bruni

Titolo della tesi: SAPIENT: Semantic and Automatic Processing of Information about Environment

The aim of the project, in collaboration with the company Expert.ai, is to implement a new system called Sapient which stands for ‘Semantic and Automatic Processing of Information about Environ- ment’. The project grew out of the need to process a multitude of complex documents, i.e. those containing information in multiple objects such as graphs, tables, etc. For this purpose, it is necessary to segment complex texts into homogeneous areas i.e. to identify the different parts that make up a document with its relative location. This system is part of a broader one able to recognise the characters of a document and is known as Optical Character Recognition (OCR). The analysis of the structure of a document by classifying it into its components such as title, figures, tables, main text etc. is of great importance and is the main objective of the project. In the Literature this topic is known as Document Layout Analysis (DLA). This project operates in the area of computer vision and specifically of pattern recognition in as much as documents are generally in PDF format and thus more related to an image than a text document. For the purposes of the system, the objective is not only to classify and locate the components of a text, but also to segment each component so that it can be extracted in an orderly manner. Therefore, Semantic Segmentation appears to be the best model for this purpose. In fact, it is not just an object detection problem, which is the mere identification and localisation of the document components within the same image, but also the capacity to classify the image pixel by pixel. The classification pipeline is initially divided into two consequential steps: layout analysis and text-only analysis. For the solution of the first phase, an end-to-end Convolutional Neural Network (CNN) implementing dilated convolution is used, while for the second phase, an end-to-end multi- scale CNN is used; a heuristic within the framework of mathematical morphology is also defined for the same purpose. Finally, the segmentation of all classes simultaneously was achieved by means of another end-to-end CNN model. The final classification allows for the segmentation of both the text and the non-text parts, thus having a final breakdown of the document into: all text parts, tables and images for non-text components and title, authors, abstract, paragraphs and its title, header, footer, notes, caption and finally lists for the segmentation of the text alone. The same classes are found in the simultaneous segmentation of text and non-text components. The comparison with the vast Literature available, explains how this system describes an alternative overall model for DLA.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma