Finding good policies in average-reward Markov Decision Processes without prior knowledge

Adrienne Tuynman; Rémy Degenne; Emilie Kaufmann

Communication Dans Un Congrès Année : 2024

Finding good policies in average-reward Markov Decision Processes without prior knowledge

(1) , (1) , (1, 2)

1
2

Adrienne Tuynman

Fonction : Auteur
PersonId : 1451812

Scool

Rémy Degenne

Fonction : Auteur
PersonId : 748911
IdHAL : remydegenne

Scool

Emilie Kaufmann

Fonction : Auteur
PersonId : 10422
IdHAL : emilie-kaufmann
ORCID : 0000-0002-5496-824X
IdRef : 197040810

Scool

Centre National de la Recherche Scientifique

Résumé

We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $\Omega(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,\delta)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.

Domaines

Informatique [cs]

Fichier principal

2405.17108v1.pdf (374.3 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Adrienne Tuynman : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-04809429

Soumis le : jeudi 28 novembre 2024-15:57:12

Dernière modification le : vendredi 29 novembre 2024-03:31:34

Dates et versions

hal-04809429 , version 1 (28-11-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04809429 , version 1
ARXIV : 2405.17108

Citer

Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann. Finding good policies in average-reward Markov Decision Processes without prior knowledge. NeurIPS, Dec 2024, Vancouver (Canada), Canada. ⟨hal-04809429⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 UNIV-LILLE CRISTAL-SCOOL ANR PEPR_IA

0 Consultations

0 Téléchargements

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager