Titolo della tesi: Markov Representations: Learning in MDP Abstractions and Non-Markovian Environments
One of the main features that we expect AI agents to possess is being capable of autonomous decision-making in complex environments. Reinforcement Learning (RL) is a very general formulation for this learning problem, because it focuses on training agents through the use of repeated attempts and numeric feedbacks. Thanks to the little prior knowledge required, RL already has a significant record of successes in many fields, including, robotics, strategy games, finance, advertising, and fine-tuning of machine learning models, such as Large Language Models, recently.
Despite the many efforts, the improvement of the efficiency and generality of RL algorithms remains a very relevant research topic to this day. While efficiency is a commonly shared objective among RL researchers, the development of general RL algorithms remains much less explored, in comparison. This should not be attributed to a lack of interest from the community. Rather, this is mainly motivated by the intrinsic complexity associated to learning in non-Markovian environments. However, both of these important research directions share one common need, that is the selection of appropriate, Markovian representations of the environment state. In MDPs, which are already Markovian, such selection is often the intended result of the abstraction process, a central concept for Hierarchical Reinforcement Learning (HRL). In non-Markovian environments, on the other hand, a Markov state is not available from the start and it should be constructed.
This thesis addresses both of these complementary directions. In a first part of this work, I will explore the concept of MDP abstractions in the context of HRL. Specifically, in two respective chapters,
(i) this thesis proposes an approach for exploiting MDP abstractions, with the objective of improving learning efficiency;
(ii) this work gives a clear formalization of how accurate and compositional MDP abstraction should be defined, contributing to embed the common intuitions behind HRL into applicable and precise notions.
Then, in a second part of this work, I will discuss how RL algorithms can be also applied in presence of partial observations or complex non-Markovian dependencies. Specifically,
(i) I analyze the expressive power of a recently introduced model, the Regular Decision Process (RDP), and how it relates to the well-known decision process for partial observations (POMDP);
(ii) finally, the last chapter proposes an offline RL algorithm for learning near-optimal policies in RDPs, and it provides the associated sample efficiency guarantees.
Both of the parts above, and this thesis as a whole, aim to contribute to the joint research effort of identifying the necessary and sufficient information for effective decision making in RL. Selecting approximate state representations is essential for HRL, being focused on efficiency and abstract reasoning, as well as for RL in non-Markovian environments, because the states should preserve all the relevant past events and forget those that are irrelevant for future decisions.