Abstract
Current statistical parsers tend to perform well only on their training domain and nearby genres. While strong performance on a few related domains is sufficient for many situations, it is advantageous for parsers to be able to generalize to a wide variety of domains. When parsing document collections involving heterogeneous domains (e.g. the web), the optimal parsing model for each document is typically not obvious. We study this problem as a new task - multiple source parser adaptation. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific target text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle baselines including the best domain-independent parsing model. Thus, we are able to demonstrate the value of customizing parsing models to specific domains.
Original language | English |
---|---|
Title of host publication | NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference |
Place of Publication | Stroudsburg, PA |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 28-36 |
Number of pages | 9 |
ISBN (Print) | 1932432655, 9781932432657 |
Publication status | Published - 2010 |
Event | 2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States Duration: 2 Jun 2010 → 4 Jun 2010 |
Other
Other | 2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 |
---|---|
Country/Territory | United States |
City | Los Angeles, CA |
Period | 2/06/10 → 4/06/10 |