Grammar Error Correction in Morphologically Rich Languages: Possible out-of Russian

Grammar Error Correction in Morphologically Rich Languages: Possible out-of Russian

Alla Rozovskaya, Dan Roth; Grammar Error Modification inside Morphologically Rich Dialects: The outcome from Russian. Deals of one’s Organization to possess Computational Linguistics 2019; eight step 1–17. doi:

Abstract

So far, the browse into the sentence structure error modification worried about English, therefore the problem has actually hardly come explored for other languages. I target the work out of fixing creating errors in the morphologically steeped languages, having a focus on Russian. We present a corrected and you can mistake-marked corpus out-of Russian learner composing and produce models which make entry to existing state-of-the-artwork measures that happen to be well-studied to own English. Even though epic efficiency enjoys been recently reached getting sentence structure error modification of low-local English writing, these types of results are restricted to domain names where abundant degree study are readily available. While the annotation is extremely expensive, such steps commonly right for many domains and you will languages. We therefore manage methods which use “minimal oversight”; that’s, individuals who do not rely on large amounts off annotated knowledge data, and have how current restricted-supervision steps expand to help you a very inflectional vocabulary such as for instance Russian. The outcome show that these procedures are particularly useful correcting errors for the grammatical phenomena that involve steeped morphology.

1 Addition

That it report address the job regarding repairing problems from inside the text. Most of the lookup in the field of grammar mistake correction (GEC) focused on fixing mistakes created by English vocabulary students. You to basic method to making reference to such errors, and that turned out extremely winning when you look at the text message modification tournaments (Dale and you will Kilgarriff, 2011; Dale mais aussi al., 2012; Ng mais aussi al., 2013, 2014; Rozovskaya ainsi que al., 2017), utilizes a machine- reading classifier paradigm which is based on the methodology having repairing context-sensitive and painful spelling mistakes (Golding and Roth, 1996, 1999; Banko and Brill, 2001). Within this means, classifiers try educated getting a certain mistake type: instance, preposition, article, otherwise noun amount (Tetreault et al., 2010; Gamon, 2010; Rozovskaya and Roth, 2010c, b; Dahlmeier and you will Ng, 2012). To start with, classifiers had been trained towards the native English analysis. Given that numerous annotated student datasets turned into available, activities was in fact also taught towards annotated learner investigation.

Recently, new statistical machine translation (MT) procedures, and sensory MT, provides gained significant popularity thanks to the method of getting higher annotated corpora off student creating (e.grams., Yuan and you may Briscoe, 2016; patt and you will Ng, 2018). Class steps work effectively into really-defined version of mistakes, while MT is good from the correcting interacting and you can state-of-the-art variety of problems, that makes this type of methods complementary in certain areas (Rozovskaya and you can Roth, 2016).

Due to the supply of higher (in-domain) datasets, generous growth during the overall performance were made in English grammar correction. Unfortunately, search towards other dialects could have been scarce. Earlier in the day performs comes with perform to produce annotated student corpora to possess Arabic (Zaghouani et al., 2014), Japanese (Mizumoto mais aussi al., 2011), and Chinese (Yu et al., 2014), and mutual jobs toward Arabic (Mohit et al., 2014; Rozovskaya mais aussi al., 2015) and Chinese mistake identification (Lee ainsi que al., 2016; Rao et al., 2017). But not, strengthening powerful activities in other dialects could have been an issue, because a strategy you to relies on big oversight is not practical round the dialects, types, and you may student experiences. Furthermore, getting languages that are state-of-the-art morphologically, we could possibly you desire a whole lot more data to handle the fresh new lexical sparsity.

Which functions is targeted on Russian, an incredibly inflectional vocabulary from the Slavic classification. Russian has over 260M sound system, to own 47% regarding who Russian is not the local words. step 1 We corrected and mistake-marked over 200K terminology off low-local Russian messages. We utilize this dataset to construct multiple sentence structure modification possibilities you to definitely draw to the and you will stretch the methods one to displayed county-of-the-art efficiency toward English sentence structure correction. As the sized our annotation is bound, weighed against what is actually useful English, one of the requirements of one’s tasks are so you can measure the fresh new effectation of that have restricted annotation towards current means. I check both the MT paradigm, and this demands considerable amounts from annotated student analysis, while the class means which can focus on any level of oversight.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.