Automatic domain adaptation for parsing

David McClosky, Eugene Charniak, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.

LanguageEnglish
Title of host publicationNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages28-36
Number of pages9
ISBN (Print)1932432655, 9781932432657
Publication statusPublished - 2010
Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
Duration: 2 Jun 20104 Jun 2010

Other

Other2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
CountryUnited States
CityLos Angeles, CA
Period2/06/104/06/10

Fingerprint

genre
statistics
Parsing
performance
Parsers
Train
Statistics
World Wide Web

Cite this

McClosky, D., Charniak, E., & Johnson, M. (2010). Automatic domain adaptation for parsing. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference (pp. 28-36). Stroudsburg, PA: Association for Computational Linguistics (ACL).
McClosky, David ; Charniak, Eugene ; Johnson, Mark. / Automatic domain adaptation for parsing. NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. Stroudsburg, PA : Association for Computational Linguistics (ACL), 2010. pp. 28-36
@inproceedings{e0591137da764eb29be857213c8d2ec3,
title = "Automatic domain adaptation for parsing",
abstract = "Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.",
author = "David McClosky and Eugene Charniak and Mark Johnson",
year = "2010",
language = "English",
isbn = "1932432655",
pages = "28--36",
booktitle = "NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference",
publisher = "Association for Computational Linguistics (ACL)",

}

McClosky, D, Charniak, E & Johnson, M 2010, Automatic domain adaptation for parsing. in NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. Association for Computational Linguistics (ACL), Stroudsburg, PA, pp. 28-36, 2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010, Los Angeles, CA, United States, 2/06/10.

Automatic domain adaptation for parsing. / McClosky, David; Charniak, Eugene; Johnson, Mark.

NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. Stroudsburg, PA : Association for Computational Linguistics (ACL), 2010. p. 28-36.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Automatic domain adaptation for parsing

AU - McClosky, David

AU - Charniak, Eugene

AU - Johnson, Mark

PY - 2010

Y1 - 2010

N2 - Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.

AB - Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.

UR - http://www.scopus.com/inward/record.url?scp=84863395582&partnerID=8YFLogxK

M3 - Conference proceeding contribution

SN - 1932432655

SN - 9781932432657

SP - 28

EP - 36

BT - NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference

PB - Association for Computational Linguistics (ACL)

CY - Stroudsburg, PA

ER -

McClosky D, Charniak E, Johnson M. Automatic domain adaptation for parsing. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. Stroudsburg, PA: Association for Computational Linguistics (ACL). 2010. p. 28-36