Segmentation in macrosyntactic units across different interaction types. A quantitative study

Our communication takes place in the context of the French-German project SegCor (Segmentation of Oral Corpora, ANR-15-FRAL-0004), focusing on the segmentation of oral corpora. The general aim is the development of a method of segmentation for oral corpora that is adequate for the analyses of interactional data at different levels and for various communities of researchers. The French and German datasets consist of ten excerpts of ten minutes each for each language[3], which represent the overall data diversity in terms of situation types. The following recorded interactions have been studied: radio talks, meal preparations, reading activities with a child, service encounters, telephone calls, table talks, social meetings, school lessons and panel discussions. In our paper, we will address the relationship between these interaction types and segmentation in maximal units. More particularly, the focus will be on the composition of this kind of units for the French corpus. Several models have been proposed in previous researches and have been discussed within the SegCor project: part-of-speech tagging and chunking processes via automatic annotation (Eshkol-Taravella et al. 2014); a syntactic annotation relying on a dependency parser (Kahane et al. 2017); a macrosyntactic segmentation in illocutionary units (Benzitoun et al. 2010; Lacheret et al. 2014); the annotation of prosodic prominences and disfluencies leading to the segmentation of intonational periods (Lacheret et al. 2014); the annotation of Turn-Constructional Units (TCUs), i.e. the minimal, emergent and negotiable units through which participants build turns of talk in interaction (Sacks et al. 1974; Ochs et al. 1996; Traverso 2016). In this paper, we will focus on the segmentation of broad units, which is grounded on the macrosyntactic model (Blanche-Benveniste et al. 1990; Blanche-Benveniste 2010a, 2010b; Lacheret et al. 2014). We rely on the following maximal macrosyntactic units: Simple units, composed of one nucleus, which is defined as a minimal macrosyntactic component corresponding to an autonomous utterance, according to Blanche-Benveniste et al. (1990: 114); Complex units, composed of more than one nucleus (including pre-nuclei, post-nuclei and in-nuclei, i.e. sequences beyond government); Abandoned units, i.e. syntactically unfinished units. The segmentation has been realized on tokenized transcripts through the EXMARaLDA Partitur Editor[4]. Our main aim is to appreciate the relevance of tokens’ number per maximal unit in our representative corpora. Thus, we propose a quantitative study that is focused on token count per maximal unit in each situation type. For example, preliminary investigation has shown a higher rate of abandoned units when interactions are conflictual (e. g. panel discussion and radio talk), due to turn-taking specificities. Conversely, in expert talk, i.e. a conference realized by a speaker, abandoned units are very few because of the planned character of the talk. Relying on the composition of maximal segmentation units, our contribution discusses evidence from corpus segmentation and aims at investigating variation across different interaction types. Our approach is not in contrast to previous research in the field of corpus linguistics, see for example Biber’s multi-dimensional analyses of written and oral genres (Biber 1988) and conversational text types (Biber 2004) in English, which are based on a variety of linguistic features. This contribution offers complementary dimensions for a classification of interaction types, from a quantitative perspective. We will then explore the other segmentation levels annotated in the SegCor project on syntax, prosody and interaction to study if unit characterization depends on the type of interaction and if similar trends can be observed. Statistical analyses and graphing are performed using the R software platform.

Mots clés

Macrosyntax Spoken language corpus Multi-level annotation Spoken French Interactional linguistics

Macrosyntaxe Annotation multi-niveau Linguistique interactionnelle Français parlé Corpus de langue parlée

Domaines

Linguistique

Biagio Ursi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01927595

Soumis le : mardi 20 novembre 2018-00:06:06

Dernière modification le : vendredi 16 février 2024-18:30:04

Dates et versions

hal-01927595 , version 1 (20-11-2018)

Identifiants

HAL Id : hal-01927595 , version 1

Citer

Biagio Ursi, Carole Etienne, Iris Eshkol-Taravella, Nathalie Rossi-Gensane, Luisa Acosta Córdoba, et al.. Segmentation in macrosyntactic units across different interaction types. A quantitative study. 50 years of corpus linguistics on oral corpora. Its contribution to the study of variation, Nov 2018, Orléans, France. ⟨hal-01927595⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON BNF UNIV-TOURS CNRS UNIV-LYON2 UNIV-ORLEANS MODYCO ICAR LLL UNIV-PARIS-LUMIERES UDL ANR UNIV-PARIS-NANTERRE

139 Consultations

0 Téléchargements