Titolo della tesi: Improving Exploration in Sparse-Reward Environments through Reachability Bonuses
Sparse-reward environments are famously challenging for deep reinforcement learning (DRL) algorithms. Several general-purpose DRL methods, that perform outstandingly otherwise, utterly fail when the incentives they receive are sparse. Nevertheless, the prospect of solving intrinsically sparse-reward tasks in an end-to-end fashion without any expert intervention is highly appealing. As that would simplify and possibly even speed up the overall process of obtaining a solution for a given problem, and it would also circumvent the assimilation of detrimental human biases in that solution. Such an aspiration has recently led to the development of countless DRL algorithms able to automatically handle sparse-reward environments to some extent. Without any of them being a definitive solution yet. Ultimately, it is this realization that end-to-end learning with sparse rewards is a purposeful and still open research problem that motivates this thesis to look deeper into such a topic. In particular, the focus of the present work is twofold. For one, to elevate the understanding of which scenarios and their features are even now out of reach for present-day techniques designed to combat reward sparsity. And for two, to engineer a novel algorithm that pushes the boundary of what can be solved with respect to its immediate predecessors. To fulfill these objectives this doctoral effort is structured into three stages. First, the composition of an insightful overview that, stemming from the reading of numerous articles proposing new methods for facing problems with sparse rewards, identifies several environmental features which decidedly impact positively or negatively the performance of such techniques. One of the many merits of this compilation is that it pinpoints quite a few holes in the existing literature. Namely, environments where better algorithms are needed as well as environments where, even more critically, further experimentation is needed since their problematics have been theorized but not yet thoroughly tested. In its second stage, seeing this latter research gap, the present thesis conducts four benchmarking studies. Each of them turns the spotlight on a particular environmental property deemed a priori difficult to handle in the absence of reinforcement and that, importantly, had not been systematically assessed up to now. In practice, every benchmarking evaluates a handful of state-of-the-art methods in one or more novel sparse-reward domains created specifically within this work to internalize exactly one of the previous features. Across the board, these studies finally demonstrate that current algorithms are not presently suitable for dealing with any of the four examined properties. However, this is all not bad news as, taking it from another perspective, these studies also pinpoint to four open problems whose solution demands fresh ideas and may lead to exciting new research directions, some of which are already discussed in this document. The last stage involves the betterment of a contemporary technique, which during the execution of the prior benchmarking endeavors was detected to have a blatant weakness in the presence of a peculiar environmental feature. In an attempt to overcome such an issue, this thesis reformulates the exploration apparatus of said algorithm and in particular its bonus scheme. At last, experiments carried out in various custom-made domains prove the success of these modifications. Not only is the improved method capable of efficiently solving tasks containing the reported property, but it also maintains the advanced benefits already provided by its precursor.