{"id":7662,"date":"2026-06-24T14:13:20","date_gmt":"2026-06-24T12:13:20","guid":{"rendered":"https:\/\/samovar.telecom-sudparis.eu\/?p=7662"},"modified":"2026-06-24T14:13:20","modified_gmt":"2026-06-24T12:13:20","slug":"avis-de-soutenance-de-monsieur-xinchen-han","status":"publish","type":"post","link":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/2026\/06\/24\/avis-de-soutenance-de-monsieur-xinchen-han\/","title":{"rendered":"AVIS DE SOUTENANCE de Monsieur XINCHEN HAN"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">L&rsquo;Ecole doctorale : Ecole Doctorale de l&rsquo;Institut Polytechnique de Paris<br><br>et le Laboratoire de recherche SAMOVAR &#8211; Services r\u00e9partis, Architectures, Mod\u00e9lisation, Validation, Administration des R\u00e9seaux<\/h2>\n\n\n\n<p>pr\u00e9sentent<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">l\u2019AVIS DE SOUTENANCE de Monsieur XINCHEN HAN<\/h2>\n\n\n\n<p>Autoris\u00e9 \u00e0 pr\u00e9senter ses travaux en vue de l\u2019obtention du Doctorat de l&rsquo;Institut Polytechnique de Paris, pr\u00e9par\u00e9 \u00e0 l&rsquo;Institut Polytechnique de Paris T\u00e9l\u00e9com SudParis en :<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Informatique<\/h2>\n\n\n\n<h1 class=\"wp-block-heading\">\u00ab \u00c9tude syst\u00e9matique de l\u2019apprentissage par renforcement hors ligne : \u00e9volution m\u00e9thodologique et conception algorithmique \u00bb<\/h1>\n\n\n\n<p>le VENDREDI 26 JUIN 2026 \u00e0 14h00<\/p>\n\n\n\n<p>\u00e0<\/p>\n\n\n\n<p>Amphith\u00e9\u00e2tre 3<br>T\u00e9l\u00e9com SudParis 19 place Marguerite Perey 91120 Palaiseau<\/p>\n\n\n\n<p><strong>Membres du jury :<\/strong><\/p>\n\n\n\n<p><strong>M. Hossam&nbsp;AFIFI<\/strong>, Full professor, Institut Polytechnique de Paris T\u00e9l\u00e9com SudParis, FRANCE &#8211; Directeur de these<br><strong>M. Michel&nbsp;MAROT<\/strong>, Full professor, Institut Polytechnique de Paris T\u00e9l\u00e9com SudParis, FRANCE &#8211; CoDirecteur de these<br><strong>Mme Salima&nbsp;BENBERNOU<\/strong>, Full professor, Universit\u00e9 Paris Descartes, FRANCE &#8211; Examinatrice<br><strong>M. Nadjib&nbsp;AITSAADI<\/strong>, Full professor, Paris-Saclay university, FRANCE &#8211; Examinateur<br><strong>M. Pascal&nbsp;LORENZ<\/strong>, Full professor, University of Haute-Alsace, FRANCE &#8211; Examinateur<br><strong>M. Lyes&nbsp;KHOUKHI<\/strong>, Full professor, Sorbonne University, FRANCE &#8211; Rapporteur<br><strong>M. Fabrice&nbsp;MOURLIN<\/strong>, Associate Professor, Paris-Est universit\u00e9, FRANCE &#8211; Rapporteur<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u00ab \u00c9tude syst\u00e9matique de l\u2019apprentissage par renforcement hors ligne : \u00e9volution m\u00e9thodologique et conception algorithmique \u00bb<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\">pr\u00e9sent\u00e9 par Monsieur XINCHEN HAN<\/h2>\n\n\n\n<p><strong>R\u00e9sum\u00e9 :<\/strong><\/p>\n\n\n\n<p>L\u2019apprentissage par renforcement hors ligne (offline RL) vise \u00e0 apprendre des politiques de d\u00e9cision \u00e0 partir de donn\u00e9es fixes, sans interaction avec l\u2019environnement. Il est pertinent lorsque la collecte en ligne est co\u00fbteuse, risqu\u00e9e ou impossible. Toutefois, l\u2019offline RL se heurte \u00e0 des d\u00e9fis fondamentaux, notamment le d\u00e9calage de distribution, les actions hors distribution (out-of-distribution, OOD) et l\u2019erreur d\u2019extrapolation, qui compliquent l\u2019\u00e9valuation et l\u2019am\u00e9lioration fiables des politiques. Cette th\u00e8se \u00e9tudie l\u2019offline RL sans mod\u00e8le avec deux objectifs : comprendre son \u00e9volution m\u00e9thodologique et d\u00e9velopper des algorithmes plus efficaces. Premi\u00e8rement, cette th\u00e8se r\u00e9examine l\u2019offline RL \u00e0 la lumi\u00e8re de la deadly triad et en analyse les d\u00e9fis. Sur cette base, elle d\u00e9veloppe une taxonomie algorithmique de l\u2019offline RL et retrace l\u2019\u00e9volution de ses principes de conception. Plut\u00f4t que de consid\u00e9rer les m\u00e9thodes existantes comme des cat\u00e9gories isol\u00e9es, la th\u00e8se montre comment les familles algorithmiques \u00e9mergent et \u00e9voluent face \u00e0 des d\u00e9fis tels que la surestimation des valeurs et le d\u00e9calage de distribution. Deuxi\u00e8mement, afin de r\u00e9pondre \u00e0 la tension entre l\u2019apprentissage in-sample et l\u2019apprentissage multi-pas, Projective Implicit Q-Learning (PIQL) est propos\u00e9. PIQL remplace l\u2019expectile fixe d\u2019Implicit Q-Learning par un param\u00e8tre adaptatif fond\u00e9 sur la projection, autorisant une interpr\u00e9tation multi-pas tout en pr\u00e9servant la propri\u00e9t\u00e9 d\u2019apprentissage in-sample. En outre, PIQL adopte un objectif d\u2019am\u00e9lioration de politique sous contrainte de support, mieux align\u00e9 sur l\u2019\u00e9tape d\u2019\u00e9valuation. L\u2019analyse th\u00e9orique \u00e9tablit une am\u00e9lioration monotone de la politique ainsi qu\u2019un crit\u00e8re plus strict pour les actions avantageuses, tandis que les exp\u00e9riences sur des benchmarks montrent de bonnes performances. Troisi\u00e8mement, la th\u00e8se introduit le cadre Continuous Constraint Interpolation (CCI) et l\u2019algorithme Automatic Constraint Policy Optimization (ACPO) pour unifier et adapter les contraintes de politique. CCI fournit une vision unifi\u00e9e de l\u2019optimisation o\u00f9 le clonage comportemental pond\u00e9r\u00e9 (weighted behavior cloning, wBC), la r\u00e9gularisation de densit\u00e9 fond\u00e9e sur la divergence de Kullback-Leibler et les contraintes de support apparaissent comme des cas particuliers d\u2019un spectre continu. Sur cette base, ACPO formalise l\u2019apprentissage adaptatif des contraintes comme un probl\u00e8me primal-dual et ajuste automatiquement le param\u00e8tre d\u2019interpolation. La th\u00e8se \u00e9tablit aussi un lemme de diff\u00e9rence de performance \u00e0 entropie maximale ainsi que des bornes inf\u00e9rieures pour la politique optimale et sa projection param\u00e9trique. Les exp\u00e9riences montrent qu\u2019ACPO obtient des r\u00e9sultats coh\u00e9rents sur divers benchmarks d\u2019offline RL. Quatri\u00e8mement, cette th\u00e8se \u00e9tudie les modes de d\u00e9faillance des m\u00e9thodes d\u2019offline RL fond\u00e9es sur les CVAE et identifie le ph\u00e9nom\u00e8ne de latent action projection space collapse, par lequel l\u2019effondrement du posterior affaiblit l\u2019optimisation de politique en aval. Sur cette base, la th\u00e8se propose Expand Latent Action Projection SpacE (ELAPSE), une m\u00e9thode qui \u00e9largit l\u2019espace de projection. ELAPSE am\u00e9liore l\u2019articulation entre la mod\u00e9lisation g\u00e9n\u00e9rative du comportement et l\u2019optimisation de politique, et les r\u00e9sultats empiriques montrent qu\u2019elle am\u00e9liore substantiellement les performances de m\u00e9thodes repr\u00e9sentatives d\u2019offline RL fond\u00e9es sur les CVAE. Dans l\u2019ensemble, cette th\u00e8se \u00e9tudie l\u2019offline RL sans mod\u00e8le sous des angles m\u00e9thodologique et algorithmique. Sur le plan analytique, elle clarifie les d\u00e9fis et l\u2019\u00e9volution m\u00e9thodologique de l\u2019offline RL. Sur le plan algorithmique, elle d\u00e9veloppe de nouvelles m\u00e9thodes pour l\u2019\u00e9valuation et l\u2019am\u00e9lioration de politique ainsi que pour la mod\u00e9lisation g\u00e9n\u00e9rative du comportement. Ensemble, ces contributions font progresser l\u2019offline RL vers une prise de d\u00e9cision sur donn\u00e9es statiques plus fiable, plus adaptative et mieux fond\u00e9e th\u00e9oriquement.<br><strong>Abstract :<\/strong><\/p>\n\n\n\n<p>Offline Reinforcement Learning (offline RL) seeks to learn decision-making policies from fixed datasets without further interaction with the environment. It is especially appealing in domains where online collection is costly, risky, or infeasible. However, offline RL faces fundamental challenges, including distribution shift, out-of-distribution (OOD) actions, and extrapolation error, which make reliable policy evaluation and improvement difficult. This thesis studies model-free offline RL with two goals: to understand its methodological evolution and to develop more effective offline RL algorithms. First, this thesis revisits offline RL through the lens of the deadly triad and analyzes its central challenges. Building on this perspective, it develops an algorithmic taxonomy of offline RL and traces the evolution of offline RL design principles. Rather than treating existing methods as isolated categories, the thesis highlights how major algorithmic families emerge and evolve in response to challenges such as value overestimation and distribution shift. Second, to address the tension between in-sample learning and multi-step learning, Projective Implicit Q-Learning (PIQL) is proposed. PIQL replaces the fixed expectile in Implicit Q-Learning with a projection-based adaptive parameter, enabling a multi-step interpretation while preserving the in-sample learning property. In addition, PIQL adopts a support-constrained policy-improvement objective that is better aligned with the policy-evaluation stage. Theoretical analysis establishes monotonic policy improvement and a progressively stricter criterion for advantageous actions, while experiments on standard benchmarks demonstrate strong performance. Third, the thesis introduces the Continuous Constraint Interpolation (CCI) framework and the Automatic Constraint Policy Optimization (ACPO) algorithm to unify and adapt policy constraints. CCI provides a unified optimization view in which weighted behavior cloning (wBC), KL-based density regularization, and support constraints arise as special cases along a continuous constraint spectrum. Building on CCI, ACPO formulates adaptive constraint learning as a primal-dual optimization problem and automatically tunes the interpolation parameter. The thesis also derives a maximum-entropy performance difference lemma and performance lower bounds for both the optimal policy and its parametric projection. Experiments show that ACPO achieves strong and consistent results across diverse offline RL benchmarks. Fourth, this thesis studies failure modes of CVAE-based offline RL and identifies latent action projection space collapse, a phenomenon whereby posterior collapse weakens downstream policy optimization. Based on this analysis, the thesis proposes Expand Latent Action Projection SpacE (ELAPSE), a simple yet effective method that enlarges the projection space. ELAPSE improves the cooperation between generative behavior modeling and policy optimization, and empirical results show that it substantially enhances the performance of representative CVAE-based offline RL methods. Overall, this thesis studies model-free offline RL from both methodological and algorithmic perspectives. At the analytical level, it clarifies the key challenges and the methodological evolution of offline RL. At the algorithmic level, it develops new methods for policy evaluation, policy improvement, and generative behavior modeling. Together, these contributions advance offline RL toward more reliable, adaptive, and theoretically grounded decision-making from static datasets.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>L&rsquo;Ecole doctorale : Ecole Doctorale de l&rsquo;Institut Polytechnique de Paris et le Laboratoire de recherche SAMOVAR &#8211; Services r\u00e9partis, Architectures, Mod\u00e9lisation, Validation, Administration des R\u00e9seaux pr\u00e9sentent l\u2019AVIS DE SOUTENANCE de Monsieur XINCHEN HAN Autoris\u00e9 \u00e0 pr\u00e9senter ses travaux en vue de l\u2019obtention du Doctorat de l&rsquo;Institut Polytechnique de Paris, pr\u00e9par\u00e9 \u00e0 l&rsquo;Institut Polytechnique de Paris [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ocean_post_layout":"","ocean_both_sidebars_style":"","ocean_both_sidebars_content_width":0,"ocean_both_sidebars_sidebars_width":0,"ocean_sidebar":"","ocean_second_sidebar":"","ocean_disable_margins":"enable","ocean_add_body_class":"","ocean_shortcode_before_top_bar":"","ocean_shortcode_after_top_bar":"","ocean_shortcode_before_header":"","ocean_shortcode_after_header":"","ocean_has_shortcode":"","ocean_shortcode_after_title":"","ocean_shortcode_before_footer_widgets":"","ocean_shortcode_after_footer_widgets":"","ocean_shortcode_before_footer_bottom":"","ocean_shortcode_after_footer_bottom":"","ocean_display_top_bar":"default","ocean_display_header":"default","ocean_header_style":"","ocean_center_header_left_menu":"","ocean_custom_header_template":"","ocean_custom_logo":0,"ocean_custom_retina_logo":0,"ocean_custom_logo_max_width":0,"ocean_custom_logo_tablet_max_width":0,"ocean_custom_logo_mobile_max_width":0,"ocean_custom_logo_max_height":0,"ocean_custom_logo_tablet_max_height":0,"ocean_custom_logo_mobile_max_height":0,"ocean_header_custom_menu":"","ocean_menu_typo_font_family":"","ocean_menu_typo_font_subset":"","ocean_menu_typo_font_size":0,"ocean_menu_typo_font_size_tablet":0,"ocean_menu_typo_font_size_mobile":0,"ocean_menu_typo_font_size_unit":"px","ocean_menu_typo_font_weight":"","ocean_menu_typo_font_weight_tablet":"","ocean_menu_typo_font_weight_mobile":"","ocean_menu_typo_transform":"","ocean_menu_typo_transform_tablet":"","ocean_menu_typo_transform_mobile":"","ocean_menu_typo_line_height":0,"ocean_menu_typo_line_height_tablet":0,"ocean_menu_typo_line_height_mobile":0,"ocean_menu_typo_line_height_unit":"","ocean_menu_typo_spacing":0,"ocean_menu_typo_spacing_tablet":0,"ocean_menu_typo_spacing_mobile":0,"ocean_menu_typo_spacing_unit":"","ocean_menu_link_color":"","ocean_menu_link_color_hover":"","ocean_menu_link_color_active":"","ocean_menu_link_background":"","ocean_menu_link_hover_background":"","ocean_menu_link_active_background":"","ocean_menu_social_links_bg":"","ocean_menu_social_hover_links_bg":"","ocean_menu_social_links_color":"","ocean_menu_social_hover_links_color":"","ocean_disable_title":"default","ocean_disable_heading":"default","ocean_post_title":"","ocean_post_subheading":"","ocean_post_title_style":"","ocean_post_title_background_color":"","ocean_post_title_background":0,"ocean_post_title_bg_image_position":"","ocean_post_title_bg_image_attachment":"","ocean_post_title_bg_image_repeat":"","ocean_post_title_bg_image_size":"","ocean_post_title_height":0,"ocean_post_title_bg_overlay":0.5,"ocean_post_title_bg_overlay_color":"","ocean_disable_breadcrumbs":"default","ocean_breadcrumbs_color":"","ocean_breadcrumbs_separator_color":"","ocean_breadcrumbs_links_color":"","ocean_breadcrumbs_links_hover_color":"","ocean_display_footer_widgets":"default","ocean_display_footer_bottom":"default","ocean_custom_footer_template":"","ocean_post_oembed":"","ocean_post_self_hosted_media":"","ocean_post_video_embed":"","ocean_link_format":"","ocean_link_format_target":"self","ocean_quote_format":"","ocean_quote_format_link":"post","ocean_gallery_link_images":"on","ocean_gallery_id":[],"footnotes":""},"categories":[286,402],"tags":[],"class_list":["post-7662","post","type-post","status-publish","format-standard","hentry","category-fractualites-ennews-fr","category-seminaires-ness-2013-fr","entry"],"_links":{"self":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/7662","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/comments?post=7662"}],"version-history":[{"count":1,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/7662\/revisions"}],"predecessor-version":[{"id":7663,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/7662\/revisions\/7663"}],"wp:attachment":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/media?parent=7662"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/categories?post=7662"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/tags?post=7662"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}