{"id":6426,"date":"2023-12-08T16:10:48","date_gmt":"2023-12-08T15:10:48","guid":{"rendered":"https:\/\/samovar.telecom-sudparis.eu\/?p=6426"},"modified":"2023-12-08T16:10:50","modified_gmt":"2023-12-08T15:10:50","slug":"seminaire-equipe-armedia-mardi-12-12-a-15h-salle-f212","status":"publish","type":"post","link":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/2023\/12\/08\/seminaire-equipe-armedia-mardi-12-12-a-15h-salle-f212\/","title":{"rendered":"S\u00e9minaire \u00e9quipe ARMEDIA mardi 12\/12 \u00e0 15h, salle F212"},"content":{"rendered":"\n<p>L&rsquo;\u00e9quipe ARMEDIA vous convie \u00e0 son s\u00e9minaire de recherche sur le th\u00e8me :&nbsp;<\/p>\n\n\n\n<p><strong>\u00ab\u00a0Analyse multimodale par apprentissage profond pour la production audiovisuelle\u00a0\u00bb<\/strong><\/p>\n\n\n\n<p>pr\u00e9sent\u00e9 par Madame <strong>Kaouther OUENNICHE<\/strong>, doctorante de l&rsquo;\u00e9quipe,&nbsp;<\/p>\n\n\n\n<p><strong>le MARDI 12 D\u00c9CEMBRE 2023 \u00e0 15h00<\/strong><\/p>\n\n\n\n<p><strong>en salle F212<\/strong> \u00e0 T\u00e9l\u00e9com SudParis 9 Rue Charles Fourier &#8211; 91000 Evry-Courcouronnes<\/p>\n\n\n\n<p><strong>ou en visio<\/strong> sur le lien :<\/p>\n\n\n\n<figure class=\"wp-block-embed\"><div class=\"wp-block-embed__wrapper\">\nhttps:\/\/webconf.imt.fr\/frontend\/cat-clh-6al\n<\/div><\/figure>\n\n\n\n<p><strong>R\u00e9sum\u00e9 (English version below) :<\/strong><\/p>\n\n\n\n<p>Dans le contexte en constante \u00e9volution du contenu audiovisuel, la n\u00e9cessit\u00e9 cruciale d&rsquo;automatiser l&rsquo;indexation et l&rsquo;organisation des archives s&rsquo;est impos\u00e9e comme un objectif primordial. En r\u00e9ponse, cette recherche explore l&rsquo;utilisation de techniques d&rsquo;apprentissage profond pour automatiser l&rsquo;extraction de m\u00e9tadonn\u00e9es diverses dans les archives, am\u00e9liorant ainsi leur accessibilit\u00e9 et leur r\u00e9utilisation. La premi\u00e8re contribution de cette recherche concerne la classification des mouvements de cam\u00e9ra. Il s&rsquo;agit d&rsquo;un aspect crucial de l&rsquo;indexation du contenu, car il permet une cat\u00e9gorisation efficace et une r\u00e9cup\u00e9ration du contenu vid\u00e9o en fonction de la dynamique visuelle qu&rsquo;il pr\u00e9sente. L&rsquo;approche propos\u00e9e utilise des r\u00e9seaux neuronaux convolutionnels 3D avec des blocs r\u00e9siduels. Une approche semi-automatique pour la construction d&rsquo;un ensemble de donn\u00e9es fiable sur les mouvements de cam\u00e9ra \u00e0 partir de vid\u00e9os disponibles au public est \u00e9galement pr\u00e9sent\u00e9e, r\u00e9duisant au minimum le besoin d&rsquo;intervention manuelle. De plus, la cr\u00e9ation d&rsquo;un ensemble de donn\u00e9es d&rsquo;\u00e9valuation exigeant, comprenant des vid\u00e9os de la vie r\u00e9elle tourn\u00e9es avec des cam\u00e9ras professionnelles \u00e0 diff\u00e9rentes r\u00e9solutions, met en \u00e9vidence la robustesse et la capacit\u00e9 de g\u00e9n\u00e9ralisation de la technique propos\u00e9e, atteignant un taux de pr\u00e9cision moyen de 94 %. La deuxi\u00e8me contribution se concentre sur la t\u00e2che de Video Question Answering. Dans ce contexte, notre framework int\u00e8gre un transformers l\u00e9ger et un module de cross-modalit\u00e9. Ce module utilise une corr\u00e9lation crois\u00e9e pour permettre un apprentissage r\u00e9ciproque entre les caract\u00e9ristiques visuelles conditionn\u00e9es par le texte et les caract\u00e9ristiques textuelles conditionn\u00e9es par la vid\u00e9o. De plus, un sc\u00e9nario de test adversarial avec des questions reformul\u00e9es met en \u00e9vidence la robustesse du mod\u00e8le et son applicabilit\u00e9 dans le monde r\u00e9el. Les r\u00e9sultats exp\u00e9rimentaux sur MSVD-QA et MSRVTT-QA, valident la m\u00e9thodologie propos\u00e9e, avec une pr\u00e9cision moyenne de 45 % et 42 % respectivement. La troisi\u00e8me contribution de cette recherche aborde le probl\u00e8me de video captioning. Le travail introduit int\u00e8gre un module de modality attention qui capture les relations complexes entre les donn\u00e9es visuelles et textuelles \u00e0 l&rsquo;aide d&rsquo;une corr\u00e9lation crois\u00e9e. De plus, l&rsquo;int\u00e9gration de l&rsquo;attention temporelle am\u00e9liore la capacit\u00e9 du mod\u00e8le \u00e0 produire des l\u00e9gendes significatives en tenant compte de la dynamique temporelle du contenu vid\u00e9o. Notre travail int\u00e8gre \u00e9galement une t\u00e2che auxiliaire utilisant une fonction de perte contrastive, ce qui favorise la g\u00e9n\u00e9ralisation du mod\u00e8le et une compr\u00e9hension plus approfondie des relations intermodales et des s\u00e9mantiques sous-jacentes. L&rsquo;utilisation d&rsquo;une architecture de transformer pour l&rsquo;encodage et le d\u00e9codage am\u00e9liore consid\u00e9rablement la capacit\u00e9 du mod\u00e8le \u00e0 capturer les interd\u00e9pendances entre les donn\u00e9es textuelles et vid\u00e9o. La recherche valide la m\u00e9thodologie propos\u00e9e par une \u00e9valuation rigoureuse sur MSRVTT, atteignant des scores BLEU4, ROUGE et METEOR de <a href=\"callto:0,4408, 0,6291\">0,4408, 0,6291<\/a> et 0,3082 respectivement. Notre approche surpasse les m\u00e9thodes de l&rsquo;\u00e9tat de l&rsquo;art, avec des gains de performance allant de 1,21 % \u00e0 1,52 % pour les trois m\u00e9triques consid\u00e9r\u00e9es. En conclusion, ce manuscrit offre une exploration holistique des techniques bas\u00e9es sur l&rsquo;apprentissage profond pour automatiser l&rsquo;indexation du contenu t\u00e9l\u00e9visuel, en abordant la nature laborieuse et chronophage de l&rsquo;indexation manuelle. Les contributions englobent la classification des types de mouvements de cam\u00e9ra, la video question answering et la video captioning, faisant avancer collectivement l&rsquo;\u00e9tat de l&rsquo;art et fournissant des informations pr\u00e9cieuses pour les chercheurs dans le domaine. Ces d\u00e9couvertes ont non seulement des applications pratiques pour la recherche et l&rsquo;indexation de contenu, mais contribuent \u00e9galement \u00e0 l&rsquo;avancement plus large des m\u00e9thodologies d&rsquo;apprentissage profond dans le contexte multimodal.<\/p>\n\n\n\n<p><strong>Abstract :<\/strong><\/p>\n\n\n\n<p>Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse. The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%. The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model&rsquo;s robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches. The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model&rsquo;s ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model&rsquo;s capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark, vi achieving BLEU4, ROUGE, and METEOR scores of <a href=\"callto:0.4408, 0.6291\">0.4408, 0.6291<\/a> and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered. In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>L&rsquo;\u00e9quipe ARMEDIA vous convie \u00e0 son s\u00e9minaire de recherche sur le th\u00e8me :&nbsp; \u00ab\u00a0Analyse multimodale par apprentissage profond pour la production audiovisuelle\u00a0\u00bb pr\u00e9sent\u00e9 par Madame Kaouther OUENNICHE, doctorante de l&rsquo;\u00e9quipe,&nbsp; le MARDI 12 D\u00c9CEMBRE 2023 \u00e0 15h00 en salle F212 \u00e0 T\u00e9l\u00e9com SudParis 9 Rue Charles Fourier &#8211; 91000 Evry-Courcouronnes ou en visio sur le [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ocean_post_layout":"","ocean_both_sidebars_style":"","ocean_both_sidebars_content_width":0,"ocean_both_sidebars_sidebars_width":0,"ocean_sidebar":"0","ocean_second_sidebar":"0","ocean_disable_margins":"enable","ocean_add_body_class":"","ocean_shortcode_before_top_bar":"","ocean_shortcode_after_top_bar":"","ocean_shortcode_before_header":"","ocean_shortcode_after_header":"","ocean_has_shortcode":"","ocean_shortcode_after_title":"","ocean_shortcode_before_footer_widgets":"","ocean_shortcode_after_footer_widgets":"","ocean_shortcode_before_footer_bottom":"","ocean_shortcode_after_footer_bottom":"","ocean_display_top_bar":"default","ocean_display_header":"default","ocean_header_style":"","ocean_center_header_left_menu":"0","ocean_custom_header_template":"0","ocean_custom_logo":0,"ocean_custom_retina_logo":0,"ocean_custom_logo_max_width":0,"ocean_custom_logo_tablet_max_width":0,"ocean_custom_logo_mobile_max_width":0,"ocean_custom_logo_max_height":0,"ocean_custom_logo_tablet_max_height":0,"ocean_custom_logo_mobile_max_height":0,"ocean_header_custom_menu":"0","ocean_menu_typo_font_family":"0","ocean_menu_typo_font_subset":"","ocean_menu_typo_font_size":0,"ocean_menu_typo_font_size_tablet":0,"ocean_menu_typo_font_size_mobile":0,"ocean_menu_typo_font_size_unit":"px","ocean_menu_typo_font_weight":"","ocean_menu_typo_font_weight_tablet":"","ocean_menu_typo_font_weight_mobile":"","ocean_menu_typo_transform":"","ocean_menu_typo_transform_tablet":"","ocean_menu_typo_transform_mobile":"","ocean_menu_typo_line_height":0,"ocean_menu_typo_line_height_tablet":0,"ocean_menu_typo_line_height_mobile":0,"ocean_menu_typo_line_height_unit":"","ocean_menu_typo_spacing":0,"ocean_menu_typo_spacing_tablet":0,"ocean_menu_typo_spacing_mobile":0,"ocean_menu_typo_spacing_unit":"","ocean_menu_link_color":"","ocean_menu_link_color_hover":"","ocean_menu_link_color_active":"","ocean_menu_link_background":"","ocean_menu_link_hover_background":"","ocean_menu_link_active_background":"","ocean_menu_social_links_bg":"","ocean_menu_social_hover_links_bg":"","ocean_menu_social_links_color":"","ocean_menu_social_hover_links_color":"","ocean_disable_title":"default","ocean_disable_heading":"default","ocean_post_title":"","ocean_post_subheading":"","ocean_post_title_style":"","ocean_post_title_background_color":"","ocean_post_title_background":0,"ocean_post_title_bg_image_position":"","ocean_post_title_bg_image_attachment":"","ocean_post_title_bg_image_repeat":"","ocean_post_title_bg_image_size":"","ocean_post_title_height":0,"ocean_post_title_bg_overlay":0.5,"ocean_post_title_bg_overlay_color":"","ocean_disable_breadcrumbs":"default","ocean_breadcrumbs_color":"","ocean_breadcrumbs_separator_color":"","ocean_breadcrumbs_links_color":"","ocean_breadcrumbs_links_hover_color":"","ocean_display_footer_widgets":"default","ocean_display_footer_bottom":"default","ocean_custom_footer_template":"0","ocean_post_oembed":"","ocean_post_self_hosted_media":"","ocean_post_video_embed":"","ocean_link_format":"","ocean_link_format_target":"self","ocean_quote_format":"","ocean_quote_format_link":"post","ocean_gallery_link_images":"off","ocean_gallery_id":[],"footnotes":""},"categories":[286,169],"tags":[],"class_list":["post-6426","post","type-post","status-publish","format-standard","hentry","category-fractualites-ennews-fr","category-seminaires-armedia","entry"],"_links":{"self":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/6426","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/comments?post=6426"}],"version-history":[{"count":1,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/6426\/revisions"}],"predecessor-version":[{"id":6427,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/posts\/6426\/revisions\/6427"}],"wp:attachment":[{"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/media?parent=6426"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/categories?post=6426"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/samovar.telecom-sudparis.eu\/index.php\/wp-json\/wp\/v2\/tags?post=6426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}