Finding good policies in average-reward Markov Decision Processes without prior knowledge - SCOOL
Communication Dans Un Congrès Année : 2024

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Adrienne Tuynman
  • Fonction : Auteur
  • PersonId : 1451812
Rémy Degenne
  • Fonction : Auteur
  • PersonId : 748911
  • IdHAL : remydegenne

Résumé

We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $\Omega(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,\delta)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
Fichier principal
Vignette du fichier
2405.17108v1.pdf (374.3 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04809429 , version 1 (28-11-2024)

Licence

Identifiants

Citer

Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann. Finding good policies in average-reward Markov Decision Processes without prior knowledge. NeurIPS, Dec 2024, Vancouver (Canada), Canada. ⟨hal-04809429⟩
0 Consultations
0 Téléchargements

Altmetric

Partager

More