Multi-latency look-ahead for streaming speaker segmentation - Structuration, Analyse et Modélisation de documents Vidéo et Audio
Conference Papers Year : 2024

Multi-latency look-ahead for streaming speaker segmentation

Abstract

We address the task of streaming speaker diarization and propose several contributions to achieve a better trade-off between latency and accuracy. First, computational latency is reduced to its bare minimum by switching to a causal frame-wise speaker segmentation architecture. Then, a multi-latency look-ahead mechanism is used during training to support adaptive latency during inference at no additional computational cost. Finally, we detail the method used during inference to achieve the final frame-wise segmentation. We evaluate the impact of these contributions on the AMI meeting dataset with a focus on the speaker segmentation step, seen through the prism of voice activity detection, overlapped speech detection and speaker change detection.
Fichier principal
Vignette du fichier
rahou24_interspeech.pdf (402.31 Ko) Télécharger le fichier
Origin Publisher files allowed on an open archive

Dates and versions

hal-04734819 , version 1 (14-10-2024)

Identifiers

Cite

Bilal Rahou, Hervé Bredin. Multi-latency look-ahead for streaming speaker segmentation. Interspeech 2024, Sep 2024, Kos, Greece. pp.1610-1614, ⟨10.21437/Interspeech.2024-923⟩. ⟨hal-04734819⟩
6 View
8 Download

Altmetric

Share

More