Titolo della tesi: SAPIENT: Semantic and Automatic Processing of Information about Environment
The aim of the project, in collaboration with the company Expert.ai, is to implement a new system
called Sapient which stands for ‘Semantic and Automatic Processing of Information about Environ-
ment’.
The project grew out of the need to process a multitude of complex documents, i.e. those
containing information in multiple objects such as graphs, tables, etc. For this purpose, it is
necessary to segment complex texts into homogeneous areas i.e. to identify the different parts that
make up a document with its relative location.
This system is part of a broader one able to recognise the characters of a document and is known
as Optical Character Recognition (OCR). The analysis of the structure of a document by classifying
it into its components such as title, figures, tables, main text etc. is of great importance and is the
main objective of the project. In the Literature this topic is known as Document Layout Analysis
(DLA).
This project operates in the area of computer vision and specifically of pattern recognition in
as much as documents are generally in PDF format and thus more related to an image than a
text document. For the purposes of the system, the objective is not only to classify and locate the
components of a text, but also to segment each component so that it can be extracted in an orderly
manner. Therefore, Semantic Segmentation appears to be the best model for this purpose. In fact,
it is not just an object detection problem, which is the mere identification and localisation of the
document components within the same image, but also the capacity to classify the image pixel by
pixel.
The classification pipeline is initially divided into two consequential steps: layout analysis and
text-only analysis. For the solution of the first phase, an end-to-end Convolutional Neural Network
(CNN) implementing dilated convolution is used, while for the second phase, an end-to-end multi-
scale CNN is used; a heuristic within the framework of mathematical morphology is also defined for
the same purpose. Finally, the segmentation of all classes simultaneously was achieved by means of
another end-to-end CNN model.
The final classification allows for the segmentation of both the text and the non-text parts,
thus having a final breakdown of the document into: all text parts, tables and images for non-text
components and title, authors, abstract, paragraphs and its title, header, footer, notes, caption and
finally lists for the segmentation of the text alone. The same classes are found in the simultaneous
segmentation of text and non-text components. The comparison with the vast Literature available, explains how this system describes an alternative overall model for DLA.