Automatic domain adaptation for parsing

David McClosky*, Eugene Charniak, Mark Johnson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contribution

81 Citations (Scopus)

Abstract

Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.

Original languageEnglish
Title of host publicationNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages28-36
Number of pages9
ISBN (Print)1932432655, 9781932432657
Publication statusPublished - 2010
Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
Duration: 2 Jun 20104 Jun 2010

Other

Other2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
CountryUnited States
CityLos Angeles, CA
Period2/06/104/06/10

Fingerprint Dive into the research topics of 'Automatic domain adaptation for parsing'. Together they form a unique fingerprint.

Cite this