Complexity as a regression task - IRISA_UBS
Communication Dans Un Congrès Année : 2024

Complexity as a regression task

La complexité vue comme une tâche de régression

Résumé

Our work aims to observe which are the most discriminating linguistic descriptors in order to produce an efficient regression model for text complexity in French. Text complexity is currently not an area with extensive studies in Natural Language Processing (NLP). Among the studies, there is research on age recommendation and text readability using classification (Mesgar and Strube, 2018; Balyan et al., 2020) and using regression (Bayot and Gonçalves, 2017; Chen et al., 2019). The regression approach shows promise over classification approach due to the more fine-grained nature of regression over classification. Our work takes place in the ANR project TextToKids[1] which is a multidisciplinary project, combining experts from linguistics, psycholinguistics, and natural language processing (NLP). This project tackles the problem of childrens' -from 7 y. old et 12 y. old -access informational content of genre diversified texts (journalistic, fictional, encyclopedic). Thus, it is directly concerned with the question of how to evaluate complexity of texts for this type of population. For solving this question, the project focuses on how to describe complexity into elementary descriptors objectively. The project has made major achievements in creating in particular two automatic tools, one for extracting linguistic descriptors (Battistelli et al., 2022) and one for predicting recommended minimal age ranges for texts' readers (Rahman et al., 2020, 2023) starting from a corpus annotated in age ranges as proposed by publishers. The goal of this communication is to present the application of the combination of these two automatic tools on a dataset of pairs of texts: experts' manually simplified texts together with their original versions. This dataset is named Alector corpus (Gala et al., 2020) and we aim to prove that the combination of our two automatic tools is able (1) to detect in a pair of texts which one is the simplified one ; (2) to identify which descriptors are the most impactful descriptors for representing the difference in complexity between the original texts and their simplified versions.
Fichier principal
Vignette du fichier
NgoBechetBattistelli Complexity 2024.pdf (343.67 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04774618 , version 1 (08-11-2024)

Identifiants

Citer

Trung Hieu Ngo, Nicolas Béchet, Delphine Battistelli. Complexity as a regression task. La complexité dans les sciences du langage, Dec 2024, Paris, France. ⟨10.48550/arXiv.2308.10586⟩. ⟨hal-04774618⟩
0 Consultations
0 Téléchargements

Altmetric

Partager

More