Scalable Lexical Correction from Wikipedia Edits Using Perceptron Reranking
Scalable Lexical Correction from Wikipedia Edits Using Perceptron Reranking
Wednesday, January 19, 2011
Elif Yamangil and Rani Nelken. Scalable Lexical Correction from Wikipedia Edits Using Perceptron Reranking. Coursework for CS 287: Natural Language Processing, Spring 2008 (unpublished).
We propose a novel model of large-scale lexical correction of all document words, including both context-sensitive spelling correction and stylistic lexical modifications, trained on Wikipedia’s edit revisions. In this task, we wish to correct all possible errors, rather than focusing on a set of predetermined target words, making the learning problem much more difficult. Our contribution is twofold. First, we find a new source of training data for text corrections by mining Wikipedia’s edit history. Since Wikipedia articles are edited collaboratively, errors introduced by one writer are likely to be subsequently corrected by others. We mine a set of 1.5 million such correction training samples. Second, we use the Wikipedia data to train a novel model of text correction, based on a generative HMM, and a reranking perceptron, forming a highly effective model of correction. We evaluate our method against context-sensitive spelling correction, obtaining state-of-the-art accuracy at a more general setting.