End-to-end speech recognition and disfluency removal

Paria Jamshid Lou, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

19 Citations (Scopus)
162 Downloads (Pure)

Abstract

Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a specialized disfluency detection model. We also propose two new metrics for evaluating integrated ASR and disfluency removal models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationFindings of ACL EMNLP 2020
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages2051-2061
Number of pages11
ISBN (Electronic)9781952148903
Publication statusPublished - 2020
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: 16 Nov 202020 Nov 2020

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online
Period16/11/2020/11/20

Bibliographical note

Copyright the Publisher 2021. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Fingerprint

Dive into the research topics of 'End-to-end speech recognition and disfluency removal'. Together they form a unique fingerprint.

Cite this