Mining Wikipedia Revision Histories for Improving Sentence Compression

Wednesday, January 19, 2011

 

Elif Yamangil and Rani Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Columbus, Ohio, June 15-20, 2008.


Click here for pdf


A well-recognized limitation of research on supervised sentence compression is the dearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision history of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitude larger than the standardly used Ziff-Davis corpus. Using this newfound data, we propose a novel lexicalized noisy channel model for sentence compression, achieving improved results in grammaticality and compression rate criteria with a slight decrease in importance.

 
 
 

next >

< previous