Titolo della tesi: Explaining Datasets in Ontology-based Data Management
We deal with explaining datasets in the context of Ontology-based Data Management (OBDM). This context is constituted by a three-layered architecture in which: an ontology layer provides a high-level, logic-based specification of a domain of interest; a data source layer stores the actual information; and a mapping layer semantically links the data sources to the specification of the domain. We study two different scenarios in this context, and we associate two corresponding different facets to the problem of explaining datasets. In particular, the two scenarios are identified by the way in which we assume the dataset is provided. In case the dataset is provided as the result of evaluating a query over the OBDM system, we assume the goal of an explanation is to show evidence for the fact that the provided dataset represents the answers of the query with respect to the OBDM system. For this reason, we refer to this problem as the one that deals with explaining query answers in OBDM. Alternatively, we consider the case in which the dataset is provided as such, and we assume the goal of an explanation is to provide a semantic characterization of the content of this dataset, by using the knowledge of the domain of interest represented by the ontology of the OBDM system. We refer to the problem associated with this case as to the one that deals with explaining the content of datasets. We provide several contributions for both scenarios.
For the scenario of explaining query answers, we consider ontologies expressed in the popular DL-Lite family of Description Logic, and we address the problem of computing explanations for answers to queries in an OBDM system where queries are either positive, in particular conjunctive queries, or negative, i.e., negation of conjunctive queries. We provide the following contributions: (i) we propose a formal, comprehensive framework for explaining query answers in OBDM systems based on DL-Lite; (ii) we present an algorithm that, given a tuple returned as an answer to a positive query, and given a weighting function, examines all the explanations of the answer, and chooses the best explanation according to such function; (iii) we do the same for the answers to negative queries. Notably, on the way to get the latter result, we present what appears to be the first algorithm that computes the answers to negative queries in DL-Lite.
For the scenario of explaining the content of datasets, we study two different variations of the problem. The first variation, that we call query characterization, aims at finding a semantic characterization of a single dataset, by means of a logical expression that when evaluated as a query over the ontology, returns exactly the dataset. The second variation is a generalization of the former, that we call query separation, and it aims at finding a semantic characterization of two datasets, the one representing a set of positive examples, and the other representing a set of negative examples. Such a characterization is searched by means of a logical expression that when evaluated as a query over the ontology, returns all the positive examples, and none of the negative ones. For both variations, since an expression that properly characterizes an input dataset (resp. two input datasets) does not always exist, our first contribution is to propose (best) approximations of the proper characterization and separation, called (minimally) complete and (maximally) sound characterizations and separations. We do this by presenting a general framework for the query characterization and separation problems in OBDM. Then, in a setting that uses the most popular languages for the OBDM paradigm, our second contribution is a comprehensive study of three natural computational problems associated with the framework, namely Verification, i.e. checking whether a given expression is a proper, complete, or sound characterization (resp. separation) of a given dataset (resp. of two given datasets). Existence, i.e. checking whether a proper, or best approximated characterization (resp. separation) of a given dataset (resp. two given datasets) exists at all. Computation, i.e. computing any proper, or any best approximated characterization (resp. separation) of a given dataset (resp. of two given datasets).
Finally, we discuss on possible customization strategies that could be applied to the problem of explaining datasets in OBDM, hence that are applicable to both scenarios we deal with, to come up with solutions that best fit a set of user-defined criteria describing the degree of comprehensibility of an explanation for that specific user.