Complexity as a regression task

Trung Hieu Ngo; Nicolas Béchet; Delphine Battistelli

doi:10.48550/arXiv.2308.10586

Communication Dans Un Congrès Année : 2024

Complexity as a regression task

La complexité vue comme une tâche de régression

(1) , (2) , (3)

1
2
3

Trung Hieu Ngo

Fonction : Auteur

Laboratoire des Sciences du Numérique de Nantes

Nicolas Béchet

Fonction : Auteur
PersonId : 181774
IdHAL : nicolas-bechet
ORCID : 0000-0001-9425-5570
IdRef : 142928879

Institut de Recherche en Informatique et Systèmes Aléatoires

Delphine Battistelli

Fonction : Auteur
PersonId : 960768

Modèles, Dynamiques, Corpus

Résumé

Our work aims to observe which are the most discriminating linguistic descriptors in order to produce an efficient regression model for text complexity in French. Text complexity is currently not an area with extensive studies in Natural Language Processing (NLP). Among the studies, there is research on age recommendation and text readability using classification (Mesgar and Strube, 2018; Balyan et al., 2020) and using regression (Bayot and Gonçalves, 2017; Chen et al., 2019). The regression approach shows promise over classification approach due to the more fine-grained nature of regression over classification. Our work takes place in the ANR project TextToKids[1] which is a multidisciplinary project, combining experts from linguistics, psycholinguistics, and natural language processing (NLP). This project tackles the problem of childrens' -from 7 y. old et 12 y. old -access informational content of genre diversified texts (journalistic, fictional, encyclopedic). Thus, it is directly concerned with the question of how to evaluate complexity of texts for this type of population. For solving this question, the project focuses on how to describe complexity into elementary descriptors objectively. The project has made major achievements in creating in particular two automatic tools, one for extracting linguistic descriptors (Battistelli et al., 2022) and one for predicting recommended minimal age ranges for texts' readers (Rahman et al., 2020, 2023) starting from a corpus annotated in age ranges as proposed by publishers. The goal of this communication is to present the application of the combination of these two automatic tools on a dataset of pairs of texts: experts' manually simplified texts together with their original versions. This dataset is named Alector corpus (Gala et al., 2020) and we aim to prove that the combination of our two automatic tools is able (1) to detect in a pair of texts which one is the simplified one ; (2) to identify which descriptors are the most impactful descriptors for representing the difference in complexity between the original texts and their simplified versions.

Mots clés

complexity regression descriptors Natural Language Processing

Domaines

Informatique [cs] Sciences de l'Homme et Société

Fichier principal

NgoBechetBattistelli Complexity 2024.pdf (343.67 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Delphine Battistelli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04774618

Soumis le : vendredi 8 novembre 2024-18:19:00

Dernière modification le : mardi 19 novembre 2024-16:35:08

Dates et versions

hal-04774618 , version 1 (08-11-2024)

Identifiants

HAL Id : hal-04774618 , version 1
DOI : 10.48550/arXiv.2308.10586

Citer

Trung Hieu Ngo, Nicolas Béchet, Delphine Battistelli. Complexity as a regression task. La complexité dans les sciences du langage, Dec 2024, Paris, France. ⟨10.48550/arXiv.2308.10586⟩. ⟨hal-04774618⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES EC-NANTES IRISA MODYCO UNAM CENTRALESUPELEC UR1-MATH-STIC LS2N LS2N-TALN UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UNIV-PARIS-LUMIERES ANR UR1-MATH-NUM UNIV-PARIS-NANTERRE NANTES-UNIVERSITE

0 Consultations

0 Téléchargements

Complexity as a regression task

La complexité vue comme une tâche de régression

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager