Présentation
La Journée Recherche SESAME est complémentaire aux séminaires mensuels SESAME et est soutenue par le réseau national IN-OVIVE. Cette première édition a pour objectif de permettre aux jeunes chercheuses et chercheurs de présenter l’avancée de leurs travaux dans le domaine de l’IA (apprentissage machine, raisonnement et représentation des connaissances, traitement de la langue, ingénierie des connaissances, … ) en présentiel à l’ensemble de la communauté de recherche montpelliéraine. Sont invitées des présentations de travaux de recherche des doctorant·e·s, post-doctorant·e·s et ingénieur·e·s.
La Journée Recherche SESAME aura lieu le 20 Octobre 2025 sur le campus de La Gaillarde Amphithéâtre 2, Bâtiment 2 bis, 2 Place Pierre Viala Campus La Gaillarde, 34000 Montpellier.
Programme
9h15-9h45 : Accueil Café
offert par les Halles de l'IA
9h45-11h00 : Conférencière invitée
Claudia D’Amato Université de Bari, Italie (1h avec 10 minutes de questions)
- Title: "Link Prediction and Explanations on Knowledge Graphs: Issues and Perspectives "
- Abstract: Knowledge Graphs (KGs) are receiving increasing attention both from academia and industry, as they represent a source of graph- based structured knowledge of unprecedented dimension, to be exploited in a multitude of application domains and research fields. Despite their large usage, it is well known that KGs suffer of incompleteness and noise, being the result of a complex building process. Significant research efforts are currently devoted to improve the coverage and quality of existing KGs, particularly focussing on the link prediction task, via adopting numeric based Machine Learning (ML) solutions that proved to scale on very large KGs. They are usually grounded on the graph structure and they generally consist of series of numbers without any obvious human interpretation, thus possibly affecting the interpretability, the explain- ability and sometimes the trustworthiness of the results. Nevertheless, KGs may rely on expressive representation languages, e.g. RDFS and OWL (ultimately grounded on Description Logics), that are also endowed with deductive reasoning capabilities, but both expressiveness and reasoning are most of the time disregarded by the majority of the numeric methods that have been developed so far. A similar issue applies when trying to explain link predictions on KGs. In this talk, the role and the value added that the semantics may have for link prediction solutions will be argued. Additionally, the importance of tacking into account semantics when computing explanations for link prediction solutions will be addressed. Hence, research directions on empowering link prediction and explanation solutions by injecting background knowledge will be presented. Afterwards, the urgent problem of evaluating explanations for link prediction results will be analyzed and a unified protocol for evaluating link prediction explanations will be presented.
11h00-11h15 : Pause
11h15-12h30 : Session Post Doctorant·e·s
(max 20 minutes: 15 minutes de présentation et 5 minutes de questions)
- Felipe VARGAS ROJAS (post-doc IRD)
- Title: "OneForestKB: A Knowledge Base for Global Forest Data Curation and Exploitation Using Geospatial Services"
- Abstract: Global forests are critical for carbon sequestration and to face climate change challenges. Forests are distributed across the planet and both local and international organizations have led efforts to develop information systems to store and manage forestry data. Such data includes spatio-temporal observations, metadata about tree species and their traits, for example the trunk diameter at breast height. Forestry data is in nature heterogeneous, multi-scale, and multi-source. In practice, however, forest data often remains stored in local installations and is rarely shared as open data, thereby neglecting the Findable, Accessible, Interoperable, and Reusable (FAIR) principles. The Semantic Web community has contributed several standards in related areas such as: phenotypic descriptions of species (OBO/PATO), spatial objects (GeoSPARQL), observations and measurements (SOSA), units of measurements (QUDT), among others. However, to our knowledge, there is a lack of comprehensive studies demonstrating how these standards can be arranged and combined to facilitate the curation, reuse, and exploitation of forestry datasets. To cope this gap, we propose the OneForest Knowledge Base (OneForestKB) that is based on a semantic profile and novel methods to validate and enrich forestry datasets. We follow a strategy of reusing existing ontologies as the definition of a semantic profile suggests. The relevance of this strategy is demonstrated through two use cases in the Amazonian forest of French Guiana, showing how to improve data quality through spatial validation rules based on the W3C constraint language, SHACL, combined with the OGC standard GeoSPARQL; the SHACL rules are generalised and shared to be applicable to any system using the GeoSPARQL model. Besides, this work provides a set of data enrichment rules relevant to forestry studies, which enable to enrich geographic regions with the calculation of ecological indexes as proxies of biodiversity. OneForestKB is designed to be extensible, allowing new validations and inferences to be added based on specific use cases.
- Reference : https://www.semantic-web-journal.net/content/oneforestkb-knowledge-base-global-forest-data-curation-and-exploitation-using-geospatial
- Filipi SOARES (post-doc MISTEA INRAE)
- Title: "Exploring a Large Language Model for Transforming Taxonomic Data into OWL: Lessons Learned and Implications for Ontology Development"
- Abstract: Managing scientific names in ontologies that represent species taxonomies is challenging due to the ever-evolving nature of these taxonomies. Manually maintaining these names becomes increasingly difficult when dealing with thousands of scientific names. To address this issue, this paper investigates the use of ChatGPT-4 to automate the development of the Organism module in the Agricultural Product Types Ontology (APTO) for species classification. Our methodology involved leveraging ChatGPT-4 to extract data from the GBIF Backbone API and generate OWL files for further integration in APTO. Two alternative approaches were explored: (1) issuing a series of prompts for ChatGPT-4 to execute tasks via the BrowserOP plugin and (2) directing ChatGPT-4 to design a Python algorithm to perform analogous tasks. Both approaches rely on a prompting method where we provide instructions, context, input data, and an output indicator. The first approach showed scalability limitations, while the second approach used the Python algorithm to overcome these challenges, but it struggled with typographical errors in data handling. This study highlights the potential of Large language models like ChatGPT-4 to streamline the management of species names in ontologies. Despite certain limitations, these tools offer promising advancements in automating taxonomy-related tasks and improving the efficiency of ontology development.
- Reference : https://doi.org/10.3724/2096-7004.di.2025.0020
- Davide DI PIERRO (post-doc MISTEA INRAE)
- Title: "OntoPFAS: an Ontology for PFAS Representation"
- Abstract: Referred to as ‘forever chemicals’, polyfluoroalkyl substances (PFAS) represent one of the major environmental concerns today. They basically represent a family of chemical compounds composed of a chain of fluorinated carbon atoms. For decades, they have been widely adopted in various industrial domains, comprising clothing, cosmetics, and other everyday products. However, it has been recently discovered that their levels of persistence and toxicity negatively impact the environment and the ecosystem. Human exposure is at its peak through water, given that ‘lightweight’ PFAS tend to spread across liquids. In this context, the newspaper Le Monde has published a map showing the areas and quantities of PFAS spread through Europe.
The goal of this research is to construct and publish a formal conceptualization capable of representing the chemical properties of PFAS, their spread across the environment, their sources (companies, product uses), and the human exposure. The current integration of this conceptualization with the data published in the PFAS Data Hub aims to represent a shared building block for all the communities willing to work in this domain. The objective is to also exploit the capability of this knowledge to be used in compliance with both Machine Learning and formal reasoning solutions. This project poses the challenge of how to construct an ontology in a field in which many domains are interested (biology, chemistry, geography, etc). The Linked Open Terms methodology has been explored and tailored for this task, with an ongoing ontology that already covers most of the main aspects of the domain.
12h30-14h00: Repas
14h00-15h45 : Session Post Doctorant·e·s et Doctorant·e·s dernière année de thèse
(max 20 minutes: 15 minutes de présentation et 5 minutes de questions)
- Luc POMME CASSIEROU (post-doc ADVANSE LIRMM)
- Title: "Visualization for the explanation of deep neural networks"
- Résumé: Comprendre le monde qui nous entoure passe par la collecte et l’analyse de données complexes, souvent trop volumineuses pour être interprétées par l’humain seul. C’est là que l’intelligence artificielle, notamment l’apprentissage automatique et profond, intervient. Bien que très performants pour extraire de la connaissance à partir des données, ces algorithmes posent des questions éthiques, car leurs prédictions sont souvent opaques. Le domaine de l’IA explicable (XAI) cherche à rendre leur fonctionnement plus transparent. Dans cette présentation, nous nous intéressons à une approche visuelle centrée sur les données pour analyser les prédictions proposées par un réseau de neurones. Ces outils visent à mieux comprendre et comparer les modèles, même pour les non-experts.
- Théophile MANDON (doctorant 3ème année de thèse ADVANSE LIRMM)
- Title: "HALIFacts: Evaluating Large Language Models for Domain-Specific Fact-Checking and Their Carbon Impact"
- Résumé: La vérification de faits est une problématique qui gagne en importance, notamment avec la propagation de fausse information concernant des élections ou l’épidémie de COVID-19. Les fausses informations se faisant de plus en plus communes, il devient pertinent de trouver des moyens technologiques d’aider les experts à vérifier l’information. Les grand modèles de langue (LLM) comme celui utilisé par ChatGPT peuvent être utilisés pour générer du texte, donc de la désinformation, mais peuvent ils aussi aider pour combattre cette désinformation ?
Dans cette présentation, nous montrons que l'utilisation des LLM permet d'obtenir des résultats aussi bon ou meilleurs que ceux de l'état de l'art à un cout d’entraînement réduit en CO2. Nous présentons HALIFacts, une chaine de traitement pour entraîner des LLM qui montre des résultats plus efficaces que les modèles plus petits tout en utilisant moins de données pour un entraînement plus court et plus efficace.
- Matthieu DE CASTELBAJAC (doctorant 3ème année de thèse ADVANSE LIRMM)
- Title: "Deep Learning for Species Recognition under High Uncertainty: Application to jellyfish images"
- Abstract: Citizen science records are a valuable source of biodiversity data, and even more essential to help track mobile marine species like jellyfish. However, these records can be highly uncertain, containing many potential errors and biases. They are typically validated by experts, which is impractical at scale. Although deep learning methods for automatic validation have shown promising results, they fail to account for the uncertainty present in both the input data and their predictions. Here, we present a semi-automated method to support record validation at scale while providing strong statistical guarantees, even with highly uncertain citizen science records.
- Juliette DEFOSSEZ (ingénieure, IATE, INRAE, Montpellier) (15 min)
- Titre: "Rapprochement de bases de données agro-alimentaires par appariement flou"
- Résumé: Les travaux de recherche autour de l'alimentation nécessitent aujourd'hui la prise en compte simultanée des multiples dimensions des produits alimentaires. Ceux-ci font appel à des bases de donnée sur la consommation des ménages français, les étiquetages en magasins ou encore la production agricole. Ces bases de données ont des granularités et des vocabulaires hétérogènes. De surcroît, il n'existe que rarement des clés de correspondance permettant de passer d’une base à l’autre, ce qui rend l'appariement long et fastidieux au vu de la quantité de données à rapprocher. Le travail que nous allons présenter est une réponse à cette problématique, et se propose d’analyser grâce à des outils NLP (BERT, LLM...) les descriptions textuelles des éléments de chaque base pour déterminer quand deux produits de deux bases agro-alimentaires différentes sont suffisamment proches pour être appariés ensemble.
- Oussama MECHHOUR (doctorant 3 ème année de thèse TETIS, AIDA, MISTEA INRAE)
- Title: "Adapting BERT and AgriBERT for Agroecology: A Small-Corpus Pretraining Approach"
- Abstract : Source variables, or observable properties, used to describe agroecological experiments are often heterogeneous, non-standardized, and multilingual, making them challenging to understand, explain, and utilize in cropping system modeling and multicriteria evaluations of agroecological system performance. A potential solution is data annotation via a controlled vocabulary, known as candidate variables, from the Agroecological Global Information System (AEGIS). However, matching source and candidate variables via their textual descriptions remains a challenging task in agroecology. Domain-general language models, such as BERT, often struggle with domain-specific tasks due to their general-purpose training data. In the literature, these models are adapted to specialized domains through further pretraining, pretraining from scratch, and/or fine-tuning on downstream tasks. However, pretraining a domain-general model on a domain-specific corpus is resource-intensive, requiring substantial time, energy, and computational resources. To the best of our knowledge, no study has further pretrained a domain-general model on a small corpus (less than 100 MB) to adapt it to a domain-specific task and evaluated it on downstream tasks without fine-tuning. To address these shortcomings, this paper proposes further pretraining BERT and AgriBERT on a small agroecology-related corpus. This approach is designed to be both time- and resource-efficient while enhancing domain adaptation. We evaluate the pretrained models on the task of matching source and candidate variable descriptions without fine-tuning. Our results show that our further pretrained AgriBERT (+ Experts + Core) model outperforms all others by more than 8% from P@1 to P@10. These findings showed that small-scale pretraining can significantly improve performance on domain-specific tasks without requiring fine-tuning.
- Référence : MAEVa: A hybrid approach for matching agroecological experiment variables: https://doi.org/10.1016/j.nlp.2025.100180
- Référence : Adapting BERT and AgriBERT for agroecology: A small-corpus pretraining approach: https://hal.science/hal-05246352
- Slides: https://www.canva.com/design/DAGzbqJyF44/QZ4a9rXy7SAcXWvpSRhARQ/edit?utm_content=DAGzbqJyF44&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton
15h30-15h45: Pause
16h00-16h45: Session Doctorant·e·s 2 ème ou 1 ère année de thèse
(max 15 minutes: 10 minutes de présentation et 5 minutes de questions)
- Célia BLONDIN (doctorante 2ème année de thèse IRD)
- Title: "Hierarchical Classification for Automated Image Annotation of Coral Reef Benthic Structures".
- Abstract: Automated benthic image annotation is crucial to efficiently monitor and protect coral reefs against climate change. Current machine learning approaches fail to capture the hierarchical nature of benthic organisms covering reef substrata, i.e., coral taxonomic levels and health condition. To address this limitation, we propose to annotate benthic images using hierarchical classification. Experiments on a custom dataset from a Northeast Brazilian coral reef show that our approach outperforms flat classifiers, improving both F1 and hierarchical F1 scores by approximately 2\% across varying amounts of training data. In addition, this hierarchical method aligns more closely with ecological objectives.
- Kawther DJOUAD (doctorante 1ère année de thèse MISTEA INRAE)
- Titre : Modélisation, intégration et exploitation sémantique des variables scientifiques pour des domaines multidisciplinaires à l'aide d'un graphe de connaissances RDF
- Abstract: Les sciences environnementales et agricoles (comme l'agro-écologie, le phénotypage des plantes, etc.) génèrent des volumes importants de données expérimentales. Ces données sont souvent exprimées à travers des mesures de variables scientifiques (ou propriétés observables) provenant de sources et de domaines variés, mesurant un trait ou une caractéristique à différentes échelles, avec une méthode et une unité spécifique. Ces variables scientifiques, lorsqu’elles sont structurées, sont représentées selon des schémas et formalismes divers tels que OBOE, I-ADOPT, SSN/SOSA, ou le modèle de la Crop Ontology. Les divergences dans les formalismes et les ontologies utilisées pour représenter ces variables limitent l'interopérabilité, la réutilisation des données et leur intégration dans des systèmes d'analyse sémantique et donc comment peut-on concevoir un graphe de connaissances RDF qui permette d'unifier ces représentations et exploiter ces variables tout en respectant les spécificités de chaque domaine et ontologie ?
- Houdem ASSOUDI (doctorant 1ère année de thèse IRD)
- Titre : Combining Knowledge Graph and GenAI to explain the impact of severe climate events on the health of the Maasai herders : Focusing on User Profiling
- Abstract: The interdisciplinary research conducted within the framework of the Project MOSAIC generates multisectoral and heterogeneous data and knowledge. This diversity highlights the need to optimize the dissemination of findings through various methods and tools. This PhD focuses on leveraging Generative AI technologies, supported by both structured data (knowledge graphs) and unstructured data (raw information), to build an integrated information ecosystem. The goal is to enable different user groups, particularly local communities, to efficiently access information on demand, in formats that adapt to their specific needs and preferences in order to stay informed and adapt to the severe climate events. So far, the research has centered on studying user profiling mechanisms across different information systems, with a special focus on their application in Retrieval-Augmented Generation (RAG) systems. This includes defining the key components of a user profile, exploring how such profiles can be integrated into the information retrieval process, and examining their influence on the final generated output.
Inscription
Si vous souhaitez présenter vos travaux, merci d’envoyer vos propositions de présentation à danai.symeonidou [at] inrae.fr et catherine.roussey [at] inrae.fr avant le 30 Septembre 2025.
Si vous souhaitez venir assister à cette journée merci de vous inscrire le plus rapidement possible à l’aide du lien pour gérer les repas et pauses https://evento.renater.fr/survey/inscription-a-la-journee-sesame-d-octobre-2025-bllt0005
Les Halles de l'IA incarnent l'essence d'une Intelligence Artificielle de confiance, éthique, souveraine et responsable, au coeur des projets portés à Montpellier par les acteurs publics et privés. Initiées par l'Université de Montpellier en harmonie avec ses axes "Nourrir, soigner, protéger" et alignées sur la dynamique Medvallée de Montpellier Méditerranée Métropole, les Halles de l'IA se définissent à la fois comme des lieux, une programmation et un label. La journée de recherche est labellisée par les Halles de l'IA.