126x Filetype PDF File size 0.28 MB Source: aclanthology.org
CzechGrammarErrorCorrectionwithaLargeandDiverseCorpus ´ ´ JakubNaplava MilanStraka JanaStrakova Alexandr Rosen Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic {naplava,straka,strakova}@ufal.mff.cuni.cz Charles University, Faculty of Arts Institute of Theoretical and Computational Linguistics, Czech Republic alexandr.rosen@ff.cuni.cz Abstract ´ 2021; Cotet et al., 2020; Naplava and Straka, Weintroduce a large and diverse Czech cor- 2019), the lack of adequate data is even more pusannotatedforgrammaticalerrorcorrection acute in languages other than English. We aim to (GEC) with the aim to contribute to the still address both the issue of scarcity of non-English scarce data resources in this domain for lan- data and the ubiquitous need for broad domain guages other than English. The Grammar coverage by presenting a new, large and diverse Error Correction Corpus for Czech (GECCC) Czechcorpus, expertly annotated for GEC. offersavarietyoffourdomains,coveringerror Grammar Error Correction Corpus for Czech distributions ranging from high error density (GECCC) includes texts from multiple domains essays written by non-native speakers, to web- in a total of 83058 sentences, being, to our knowl- site texts, whereerrorsareexpectedtobemuch edge,thelargestnon-EnglishGECcorpus,aswell lesscommon.WecompareseveralCzechGEC as being one of the largest GEC corpora overall. systems, including several Transformer-based ones, setting a strong baseline to future re- In order to represent a diversity of writing search. Finally, we meta-evaluate common styles and origins, besides essays of both native GECmetricsagainsthumanjudgmentsonour and non-native speakers from Czech learner cor- data. We make the new Czech GEC corpus pora,wealsoscrapedwebsitetextstocomplement publicly available under the CC BY-SA 4.0 li- the learner domain with supposedly lower error cense at http://hdl.handle.net/11234 density texts, encompassing a representation of /1-4639. the following four domains: 1 Introduction • Native Formal – essays written by native Representative data both in terms of size and studentsofelementaryandsecondaryschools domaincoveragearevitalforNLPsystemsdevel- • Native Web Informal – informal website opment. However, in the field of grammar error discussions correction (GEC), most GEC corpora are limited • Romani – essays written by children and to corrections of mistakes made by foreign or teenagers of the Romani ethnic minority second language learners even in the case of En- glish (Tajiri et al., 2012; Dahlmeier et al., 2013; • Second Learners – essays written by non- Yannakoudakisetal.,2011,2018;Ngetal.,2014; native learners Napolesetal.,2017).Atthesametime,asrecently pointed out by Flachs et al. (2020), learner cor- Using the presented data, we compare several pora are only a part of the full spectrum of GEC state-of-the-art Czech GEC systems, including applications. To alleviate the skewed perspective, someTransformer-based. the authors released a corpus of website texts. Finally, we conduct a meta-evaluation of GEC Despite recent efforts aimed to mitigate the metrics against human judgments to select the notoriousshortageofnationalGEC-annotatedcor- mostappropriatemetricforevaluatingcorrections pora (Boyd, 2018; Rozovskaya and Roth, 2019; on the new dataset. The analysis is performed Davidson et al., 2020; Syvokon and Nahorna, across domains, in line with Napoles et al. (2019). 452 Transactions of the Association for Computational Linguistics, vol. 10, pp. 452–467, 2022. https://doi.org/10.1162/tacl a 00470 Action Editor: Alice Oh. Submission batch: 6/2021; Revision batch: 11/2021; Published 4/2022. c 2022AssociationforComputational Linguistics. Distributed under a CC-BY 4.0 license. Language Corpus Sentences Err. r. Domain # Refs. Lang-8 1147451 14.1% SL 1 NUCLE 57151 6.6% SL 1 English FCE 33236 11.5% SL 1 W&I+LOCNESS 43169 11.8% SL, native students 5 CoNLL-2014test 1312 8.2% SL 2,10,8 JFLEG 1511 — SL 4 GMEG 6000 — web, formal articles, SL 4 AESW over 1M — scientific writing 1 CWEB 13574 ∼2% web 2 Czech AKCES-GEC 47371 21.4% SLessays, Romani ethnolect of Czech 2 German Falko-MERLIN 24077 16.8% SLessays 1 Russian RULEC-GEC 12480 6.4% SL, heritage speakers 1 Spanish COWS-L2H 12336 — SL, heritage speakers 2 Ukrainian UA-GEC 20715 7.1% natives/SL, translations and personal texts 2 Romanian RONACC 10119 — native speakers transcriptions 1 Table 1: Comparison of GEC corpora in size, token error rate, domain, and number of reference annotations in the test portion. SL = second language learners. Our contributions include (i) a large and di- plemented by the LOCNESS corpus (Granger, verse Czech GEC corpus, covering learner cor- 1998), a collection of essays written by native pora and website texts, with unified and, in some English students. domains, completely new GEC annotations, (ii) The GEC error annotations for the learner a comparison of Czech GEC systems, and (iii) corpora above were distributed with the BEA- a meta-evaluation of common GEC metrics 2019 Shared Task on Grammatical Error Correc- against human judgment on the released corpus. tion (Bryant et al., 2019). TheCoNLL-2014sharedtasktestset(Ngetal., 2 Related Work 2014) is often used for GEC systems evaluation. 2.1 GrammarErrorCorrectionCorpora This small corpus consists of 50 essays written by 25 South-East Asian undergraduates. Until recently, attention has been focused mostly JFLEG (Napoles et al., 2017) is another fre- on English, while GEC data resources for other quently used GEC corpus with fluency edits in languages were in short supply. Here we list a addition to usual grammatical edits. few examples of English GEC corpora, collected To broaden the restricted variety of domains, mostly within an English-as-a-second-language focused primarily on learner essays, a CWEB col- (ESL) paradigm. For a comparison of their rele- lection (Flachs et al., 2020) of website texts was vant statistics see Table 1. recently released, aiming at contributing lower Lang-8CorpusofLearnerEnglish(Tajirietal., error density data. 2012)isacorpusofEnglishlanguagelearnertexts AESW (Daudaravicius et al., 2016) is a large from the Lang-8 social networking system. corpus of scientific writing (over 1M sentences), NUCLE (Dahlmeier et al., 2013) consists of edited by professional editors. essays written by undergraduate students of the Finally, Napoles et al. (2019) recently released National University of Singapore. GMEG,acorpusfortheevaluationofGECmetrics FCE (Yannakoudakis et al., 2011) includes across domains. short essays written by non-native learners for the Grammatical error correction corpora for lan- Cambridge ESOLFirst Certificate in English. guagesotherthanEnglisharelesscommonand— W&I+LOCNESSisaunionoftwodatasets,the if available—usually limited in size and domain: W&I(Write & Improve) dataset (Yannakoudakis German Falko-MERLIN (Boyd, 2018), Russian et al., 2018) of non-native learners essays, com- RULEC-GEC (Rozovskaya and Roth, 2019), 453 Spanish COWS-L2H (Davidson et al., 2020), CzeSL, which differ mainly to what extent and 3 Ukrainian UA-GEC (SyvokonandNahorna,2021), howthetexts are annotated (Rosen et al., 2020). and Romanian RONACC (Cotet et al., 2020). More recently, hand-written essays have been To better account for multiple correction op- transcribed and annotated in TEITOK (Janssen, tions, datasets often contain several reference sen- 2016),4 a tool combining a number of cor- tences for each original noisy sentence in the test pus compilation, annotation and exploitation set, proposed by multiple annotators. As we can functionalities. seeinTable1,thenumberofannotationstypically LearnerCzechisalsorepresentedinMERLIN,a ranges between 1 and 5 with an exception of the multilingual (German, Italian, and Czech) corpus CoNLL14testset,which—ontopoftheofficial2 built in 2012–2014 from texts submitted as a part reference corrections—later received 10 annota- of tests for language proficiency levels (Boyd tions from Bryant and Ng (2015) and 8 alternative et al., 2014).5 annotations from Sakaguchi et al. (2016). ´ Finally, AKCES-GEC (Naplava and Straka, 2.2 CzechLearnerCorpora 2019) is a GEC corpus for Czech created from the subset of the above mentioned AKCES re- By the early 2010s, Czech was one of a few ˇ sources (Sebesta, 2010): the CzeSL-man corpus languages other than English to boast a series (non-native Czech learners with manual annota- of learner corpora, compiled under the umbrella tion) and a part of the ROMi corpus (speakers of projectAKCES,evokingtheconceptofacquisition the Romani ethnolect). ˇ corpora (Sebesta, 2010). Compared to the AKCES-GEC, the new The native section includes transcripts of GECCCcorpuscontainsmuchmoredata(47371 hand-written essays (SKRIPT 2012) and class- sentences vs. 83058 sentences, respectively), by room conversation (SCHOLA 2010) from ele- extending data in the existing domains and also mentary and secondary schools. Both have their addingtwonewdomains:essayswrittenbynative counterparts documenting the Roma ethnolect of learners and website texts, making it the largest 1 essays (ROMi 2013) and recordings and Czech: non-English GEC corpus and one of the largest 2 transcripts of dialogues (ROMi 1.0). GECcorporaoverall. The non-native section goes by the name of CzeSL, the acronym of Czech as the Second 3 Annotation Language. CzeSL consists of transcripts of short 3.1 DataSelection hand-written essays collected from non-native learners with various levels of proficiency and na- We draw the original uncorrected data from tive languages, mostly students attending Czech the following Czech learner corpora or Czech language courses before or during their studies at websites: a Czech university. There are several releases of 1The Romani ethnolect of Czech is the result of contact • NativeFormal–essayswrittenbynativestu- with Romani as the linguistic substrate. To a lesser (and dents of elementary and secondary schools weakening) extent the ethnolect shows some influence of from the SKRIPT 2012 learner corpus, SlovakorevenHungarian,becausemostofitsspeakershave compiled in the AKCES project roots in Slovakia. The ethnolect can exhibit various specifics • Native Web Informal – newly annotated across all linguistic levels. However, nearly all of them are complementary with their colloquial or standard Czech informal website discussions from Czech counterparts. A short written text, devoid of phonological Facebook Dataset (Habernal et al., 2013a,b) properties, may be hard to distinguish from texts written by and Czech news site novinky.cz. learners without the Romani backround. The only striking exceptionaremisspellingsincontextswherethelatterbenefit • Romani – essays written by children and frommoreexposuretowrittenCzech.Thetypicalexampleis teenagersoftheRomaniethnicminorityfrom the omission of word boundaries within phonological words, the ROMi corpus of the AKCES project and e.g., betweenacliticanditshost.Inotherrespects,thepattern of error distribution in texts produced by ethnolect speakers the ROMisectionoftheAKCES-GECcorpus ˇ ´ is closer to native rather than foreign learners (Borkovcova, 2007, 2017). 3ForalistofCzeSLcorporawiththeirsizesandannotation 2AmorerecentreleaseSKRIPT2015includesabalanced details see http://utkl.ff.cuni.cz/learncorp/. mixofessays from SKRIPT 2012 and ROMi 2013. For more 4http://www.teitok.org. details and links see http://utkl.ff.cuni.cz/akces/. 5https://www.merlin-platform.eu. 454 Dataset Documents Selected To achieve more fine-grained balancing of the AKCES-GEC-test 188 188 splits, we used additional metadata where avail- AKCES-GEC-dev 195 195 able: users proficiency levels and origin language MERLIN 441 385 from MERLIN andtheagegroupfromAKCES. Novinky.cz — 2695 3.2 Preprocessing Facebook 10000 3850 De/tokenization is an important part of data pre- SKRIPT2012 394 167 processing in grammar error correction. Some ROMi 1529 218 2 format (Dahlmeier and formats, such as the M Table 2: Data resources for the new Czech GEC Ng, 2012), require tokenized formats to track and corpus. The second column (Selected) shows the evaluate correction edits. On the other hand, deto- size of the selected subset from all available kenizedtextinitsnaturalformisrequiredforother documents (first column, Documents). applications. We therefore release our corpus in 2 format and deto- two formats: a tokenized M kenizedformatalignedatsentence,paragraph,and documentlevel. As part of our data is drawn from earlier, tokenized GEC corpora AKCES-GEC • Second Learners – essays written by non- and MERLIN, this data had to be detokenized. A native learners, from the Foreigners section 6 slightly modified Moses detokenizer is attached oftheAKCES-GECcorpus,andtheMERLIN 2 to the corpus. To tokenize the data for the M corpus format, we use the UDPipe tokenizer (Straka et al., 2016). Since we draw our data from several Czech cor- pora originally created in different tools with 3.3 Annotation different annotation schemes and instructions, we The test and development sets in all domains re-annotated the errors in a unified manner for the were annotated from scratch by five in-house ex- entire development and test set and partially also 7 for the training set. pert annotators, including re-annotations of the development and test data of the earlier GEC cor- The data split was carefully designed to main- pora to achieve a unified annotation style. All the tain representativeness, coverage and backwards test sentences were annotated by two annotators; compatibility. Specifically, (i) test and develop- one half of the development sentences received ment data contain roughly the same amount of two annotations and the second half one annota- annotated data from all domains, (ii) original tion. The annotation process took about 350 hours AKCES-GEC dataset splits remain unchanged, in total. and (iii) additional available detailed annotations Theannotationinstructions were unified across such as user proficiency level in MERLIN were all domains: The corrected text must not contain leveraged to support the split balance. Overall, any grammatical or spelling errors and should the main objective was to achieve a representative sound fluent. Fluency edits are allowed if the cover over development and testing data. Table 2 original is incoherent. The entire document was presents the sizes of data resources in the num- given as a context for the annotation. Annotators ber of documents. The first column (Documents) were instructed to remove documents that were shows the number of all available documents too incomprehensible or those containing private collected in an initial scan. The second column information. (Selected) is a selected subset from the available To keep the annotation process simple for the documents, due to budgetary constraints and to annotators, the sentences were annotated (cor- achieve a representative sample over all domains rected) in a text editor and postprocessed auto- anddataportions.Therelativelyhighernumberof matically to retrieve and categorize the GEC edits documents selected for the Native Web Informal domain is due to its substantially shorter texts, 6https://github.com/moses-smt/mosesdecoder yielding fewer sentences; also, we needed to pop- /blob/master/scripts/tokenizer/detokenizer.perl. 7Our annotators are senior undergraduate students of ulate this part of the corpus as a completely new humanities,regularlyemployedforvariousannotationefforts domainwithnopreviously annotated data. at our institute. 455
no reviews yet
Please Login to review.