Applied Linguistics Pdf 103871

Partial capture of text on file.
                     CzechGrammarErrorCorrectionwithaLargeandDiverseCorpus
                                      ´                                            ´                      
                           JakubNaplava          MilanStraka         JanaStrakova          Alexandr Rosen
                                        Charles University, Faculty of Mathematics and Physics
                                      Institute of Formal and Applied Linguistics, Czech Republic
                                     {naplava,straka,strakova}@ufal.mff.cuni.cz
                                                   Charles University, Faculty of Arts
                                 Institute of Theoretical and Computational Linguistics, Czech Republic
                                                   alexandr.rosen@ff.cuni.cz
                                     Abstract                                                        ´
                                                                       2021; Cotet et al., 2020; Naplava and Straka,
                     Weintroduce a large and diverse Czech cor-        2019), the lack of adequate data is even more
                     pusannotatedforgrammaticalerrorcorrection         acute in languages other than English. We aim to
                     (GEC) with the aim to contribute to the still     address both the issue of scarcity of non-English
                     scarce data resources in this domain for lan-     data and the ubiquitous need for broad domain
                     guages other than English. The Grammar            coverage by presenting a new, large and diverse
                     Error Correction Corpus for Czech (GECCC)         Czechcorpus, expertly annotated for GEC.
                     offersavarietyoffourdomains,coveringerror           Grammar Error Correction Corpus for Czech
                     distributions ranging from high error density     (GECCC) includes texts from multiple domains
                     essays written by non-native speakers, to web-    in a total of 83058 sentences, being, to our knowl-
                     site texts, whereerrorsareexpectedtobemuch        edge,thelargestnon-EnglishGECcorpus,aswell
                     lesscommon.WecompareseveralCzechGEC               as being one of the largest GEC corpora overall.
                     systems, including several Transformer-based
                     ones, setting a strong baseline to future re-       In order to represent a diversity of writing
                     search. Finally, we meta-evaluate common          styles and origins, besides essays of both native
                     GECmetricsagainsthumanjudgmentsonour              and non-native speakers from Czech learner cor-
                     data. We make the new Czech GEC corpus            pora,wealsoscrapedwebsitetextstocomplement
                     publicly available under the CC BY-SA 4.0 li-     the learner domain with supposedly lower error
                     cense at http://hdl.handle.net/11234              density texts, encompassing a representation of
                     /1-4639.                                          the following four domains:
                 1   Introduction                                        • Native Formal – essays written by native
                 Representative data both in terms of size and              studentsofelementaryandsecondaryschools
                 domaincoveragearevitalforNLPsystemsdevel-               • Native Web Informal – informal website
                 opment. However, in the field of grammar error             discussions
                 correction (GEC), most GEC corpora are limited          • Romani – essays written by children and
                 to corrections of mistakes made by foreign or              teenagers of the Romani ethnic minority
                 second language learners even in the case of En-
                 glish (Tajiri et al., 2012; Dahlmeier et al., 2013;     • Second Learners – essays written by non-
                 Yannakoudakisetal.,2011,2018;Ngetal.,2014;                 native learners
                 Napolesetal.,2017).Atthesametime,asrecently
                 pointed out by Flachs et al. (2020), learner cor-       Using the presented data, we compare several
                 pora are only a part of the full spectrum of GEC      state-of-the-art Czech GEC systems, including
                 applications. To alleviate the skewed perspective,    someTransformer-based.
                 the authors released a corpus of website texts.         Finally, we conduct a meta-evaluation of GEC
                   Despite recent efforts aimed to mitigate the        metrics against human judgments to select the
                 notoriousshortageofnationalGEC-annotatedcor-          mostappropriatemetricforevaluatingcorrections
                 pora (Boyd, 2018; Rozovskaya and Roth, 2019;          on the new dataset. The analysis is performed
                 Davidson et al., 2020; Syvokon and Nahorna,           across domains, in line with Napoles et al. (2019).
                                                                   452
                          Transactions of the Association for Computational Linguistics, vol. 10, pp. 452–467, 2022. https://doi.org/10.1162/tacl a 00470
                                     Action Editor: Alice Oh. Submission batch: 6/2021; Revision batch: 11/2021; Published 4/2022.
                                       c
                                      2022AssociationforComputational Linguistics. Distributed under a CC-BY 4.0 license.
                     Language          Corpus          Sentences     Err. r.                     Domain                       # Refs.
                                  Lang-8                1147451      14.1%     SL                                                   1
                                  NUCLE                    57151      6.6%     SL                                                   1
                   English        FCE                      33236     11.5%     SL                                                   1
                                  W&I+LOCNESS              43169     11.8%     SL, native students                                  5
                                  CoNLL-2014test            1312      8.2%     SL                                             2,10,8
                                  JFLEG                     1511       —       SL                                                   4
                                  GMEG                      6000       —       web, formal articles, SL                             4
                                  AESW                   over 1M       —       scientific writing                                   1
                                  CWEB                     13574      ∼2% web                                                       2
                   Czech          AKCES-GEC                47371     21.4%     SLessays, Romani ethnolect of Czech                  2
                   German         Falko-MERLIN             24077     16.8%     SLessays                                             1
                   Russian        RULEC-GEC                12480      6.4%     SL, heritage speakers                                1
                   Spanish        COWS-L2H                 12336       —       SL, heritage speakers                                2
                   Ukrainian      UA-GEC                   20715      7.1%     natives/SL, translations and personal texts          2
                   Romanian       RONACC                   10119       —       native speakers transcriptions                       1
                   Table 1: Comparison of GEC corpora in size, token error rate, domain, and number of reference
                   annotations in the test portion. SL = second language learners.
                      Our contributions include (i) a large and di-            plemented by the LOCNESS corpus (Granger,
                   verse Czech GEC corpus, covering learner cor-               1998), a collection of essays written by native
                   pora and website texts, with unified and, in some           English students.
                   domains, completely new GEC annotations, (ii)                  The GEC error annotations for the learner
                   a comparison of Czech GEC systems, and (iii)                corpora above were distributed with the BEA-
                   a meta-evaluation of common GEC metrics                     2019 Shared Task on Grammatical Error Correc-
                   against human judgment on the released corpus.              tion (Bryant et al., 2019).
                                                                                  TheCoNLL-2014sharedtasktestset(Ngetal.,
                   2   Related Work                                            2014) is often used for GEC systems evaluation.
                   2.1   GrammarErrorCorrectionCorpora                         This small corpus consists of 50 essays written
                                                                               by 25 South-East Asian undergraduates.
                   Until recently, attention has been focused mostly              JFLEG (Napoles et al., 2017) is another fre-
                   on English, while GEC data resources for other              quently used GEC corpus with fluency edits in
                   languages were in short supply. Here we list a              addition to usual grammatical edits.
                   few examples of English GEC corpora, collected                 To broaden the restricted variety of domains,
                   mostly within an English-as-a-second-language               focused primarily on learner essays, a CWEB col-
                   (ESL) paradigm. For a comparison of their rele-             lection (Flachs et al., 2020) of website texts was
                   vant statistics see Table 1.                                recently released, aiming at contributing lower
                      Lang-8CorpusofLearnerEnglish(Tajirietal.,                error density data.
                   2012)isacorpusofEnglishlanguagelearnertexts                    AESW (Daudaravicius et al., 2016) is a large
                   from the Lang-8 social networking system.                   corpus of scientific writing (over 1M sentences),
                      NUCLE (Dahlmeier et al., 2013) consists of               edited by professional editors.
                   essays written by undergraduate students of the                Finally, Napoles et al. (2019) recently released
                   National University of Singapore.                           GMEG,acorpusfortheevaluationofGECmetrics
                      FCE (Yannakoudakis et al., 2011) includes                across domains.
                   short essays written by non-native learners for the            Grammatical error correction corpora for lan-
                   Cambridge ESOLFirst Certificate in English.                 guagesotherthanEnglisharelesscommonand—
                      W&I+LOCNESSisaunionoftwodatasets,the                     if available—usually limited in size and domain:
                   W&I(Write & Improve) dataset (Yannakoudakis                 German Falko-MERLIN (Boyd, 2018), Russian
                   et al., 2018) of non-native learners essays, com-          RULEC-GEC (Rozovskaya and Roth, 2019),
                                                                           453
                    Spanish COWS-L2H (Davidson et al., 2020),                      CzeSL, which differ mainly to what extent and
                                                                                                                                            3
                    Ukrainian UA-GEC (SyvokonandNahorna,2021),                     howthetexts are annotated (Rosen et al., 2020).
                    and Romanian RONACC (Cotet et al., 2020).                         More recently, hand-written essays have been
                       To better account for multiple correction op-               transcribed and annotated in TEITOK (Janssen,
                    tions, datasets often contain several reference sen-           2016),4 a tool combining a number of cor-
                    tences for each original noisy sentence in the test            pus compilation, annotation and exploitation
                    set, proposed by multiple annotators. As we can                functionalities.
                    seeinTable1,thenumberofannotationstypically                       LearnerCzechisalsorepresentedinMERLIN,a
                    ranges between 1 and 5 with an exception of the                multilingual (German, Italian, and Czech) corpus
                    CoNLL14testset,which—ontopoftheofficial2                       built in 2012–2014 from texts submitted as a part
                    reference corrections—later received 10 annota-                of tests for language proficiency levels (Boyd
                    tions from Bryant and Ng (2015) and 8 alternative              et al., 2014).5
                    annotations from Sakaguchi et al. (2016).                                                         ´
                                                                                      Finally, AKCES-GEC (Naplava and Straka,
                    2.2    CzechLearnerCorpora                                     2019) is a GEC corpus for Czech created from
                                                                                   the subset of the above mentioned AKCES re-
                    By the early 2010s, Czech was one of a few                                 ˇ
                                                                                   sources (Sebesta, 2010): the CzeSL-man corpus
                    languages other than English to boast a series                 (non-native Czech learners with manual annota-
                    of learner corpora, compiled under the umbrella                tion) and a part of the ROMi corpus (speakers of
                    projectAKCES,evokingtheconceptofacquisition                    the Romani ethnolect).
                               ˇ
                    corpora (Sebesta, 2010).                                          Compared to the AKCES-GEC, the new
                       The native section includes transcripts of                  GECCCcorpuscontainsmuchmoredata(47371
                    hand-written essays (SKRIPT 2012) and class-                   sentences vs. 83058 sentences, respectively), by
                    room conversation (SCHOLA 2010) from ele-                      extending data in the existing domains and also
                    mentary and secondary schools. Both have their                 addingtwonewdomains:essayswrittenbynative
                    counterparts documenting the Roma ethnolect of                 learners and website texts, making it the largest
                            1 essays (ROMi 2013) and recordings and
                    Czech:                                                         non-English GEC corpus and one of the largest
                                                               2
                    transcripts of dialogues (ROMi 1.0).                           GECcorporaoverall.
                       The non-native section goes by the name of
                    CzeSL, the acronym of Czech as the Second                      3 Annotation
                    Language. CzeSL consists of transcripts of short               3.1    DataSelection
                    hand-written essays collected from non-native
                    learners with various levels of proficiency and na-            We draw the original uncorrected data from
                    tive languages, mostly students attending Czech                the following Czech learner corpora or Czech
                    language courses before or during their studies at             websites:
                    a Czech university. There are several releases of
                       1The Romani ethnolect of Czech is the result of contact        • NativeFormal–essayswrittenbynativestu-
                    with Romani as the linguistic substrate. To a lesser (and            dents of elementary and secondary schools
                    weakening) extent the ethnolect shows some influence of              from the SKRIPT 2012 learner corpus,
                    SlovakorevenHungarian,becausemostofitsspeakershave                   compiled in the AKCES project
                    roots in Slovakia. The ethnolect can exhibit various specifics    • Native Web Informal – newly annotated
                    across all linguistic levels. However, nearly all of them
                    are complementary with their colloquial or standard Czech            informal website discussions from Czech
                    counterparts. A short written text, devoid of phonological           Facebook Dataset (Habernal et al., 2013a,b)
                    properties, may be hard to distinguish from texts written by         and Czech news site novinky.cz.
                    learners without the Romani backround. The only striking
                    exceptionaremisspellingsincontextswherethelatterbenefit           • Romani – essays written by children and
                    frommoreexposuretowrittenCzech.Thetypicalexampleis                   teenagersoftheRomaniethnicminorityfrom
                    the omission of word boundaries within phonological words,           the ROMi corpus of the AKCES project and
                    e.g., betweenacliticanditshost.Inotherrespects,thepattern
                    of error distribution in texts produced by ethnolect speakers        the ROMisectionoftheAKCES-GECcorpus
                                                                     ˇ       ´
                    is closer to native rather than foreign learners (Borkovcova,
                    2007, 2017).                                                       3ForalistofCzeSLcorporawiththeirsizesandannotation
                       2AmorerecentreleaseSKRIPT2015includesabalanced              details see http://utkl.ff.cuni.cz/learncorp/.
                    mixofessays from SKRIPT 2012 and ROMi 2013. For more               4http://www.teitok.org.
                    details and links see http://utkl.ff.cuni.cz/akces/.               5https://www.merlin-platform.eu.
                                                                               454
                  Dataset               Documents      Selected         To achieve more fine-grained balancing of the
                  AKCES-GEC-test             188          188         splits, we used additional metadata where avail-
                  AKCES-GEC-dev              195          195         able: users proficiency levels and origin language
                  MERLIN                     441          385         from MERLIN andtheagegroupfromAKCES.
                  Novinky.cz                —            2695         3.2  Preprocessing
                  Facebook                10000          3850         De/tokenization is an important part of data pre-
                  SKRIPT2012                 394          167         processing in grammar error correction. Some
                  ROMi                     1529           218                                 2 format (Dahlmeier and
                                                                      formats, such as the M
                 Table 2: Data resources for the new Czech GEC        Ng, 2012), require tokenized formats to track and
                 corpus. The second column (Selected) shows the       evaluate correction edits. On the other hand, deto-
                 size of the selected subset from all available       kenizedtextinitsnaturalformisrequiredforother
                 documents (first column, Documents).                 applications. We therefore release our corpus in
                                                                                                   2 format and deto-
                                                                      two formats: a tokenized M
                                                                      kenizedformatalignedatsentence,paragraph,and
                                                                      documentlevel. As part of our data is drawn from
                                                                      earlier, tokenized GEC corpora AKCES-GEC
                   • Second Learners – essays written by non-         and MERLIN, this data had to be detokenized. A
                      native learners, from the Foreigners section                                        6
                                                                      slightly modified Moses detokenizer is attached
                      oftheAKCES-GECcorpus,andtheMERLIN                                                               2
                                                                      to the corpus. To tokenize the data for the M
                      corpus                                          format, we use the UDPipe tokenizer (Straka
                                                                      et al., 2016).
                 Since we draw our data from several Czech cor-
                 pora originally created in different tools with      3.3  Annotation
                 different annotation schemes and instructions, we    The test and development sets in all domains
                 re-annotated the errors in a unified manner for the  were annotated from scratch by five in-house ex-
                 entire development and test set and partially also                  7
                 for the training set.                                pert annotators, including re-annotations of the
                                                                      development and test data of the earlier GEC cor-
                   The data split was carefully designed to main-     pora to achieve a unified annotation style. All the
                 tain representativeness, coverage and backwards      test sentences were annotated by two annotators;
                 compatibility. Specifically, (i) test and develop-   one half of the development sentences received
                 ment data contain roughly the same amount of         two annotations and the second half one annota-
                 annotated data from all domains, (ii) original       tion. The annotation process took about 350 hours
                 AKCES-GEC dataset splits remain unchanged,           in total.
                 and (iii) additional available detailed annotations    Theannotationinstructions were unified across
                 such as user proficiency level in MERLIN were        all domains: The corrected text must not contain
                 leveraged to support the split balance. Overall,     any grammatical or spelling errors and should
                 the main objective was to achieve a representative   sound fluent. Fluency edits are allowed if the
                 cover over development and testing data. Table 2     original is incoherent. The entire document was
                 presents the sizes of data resources in the num-     given as a context for the annotation. Annotators
                 ber of documents. The first column (Documents)       were instructed to remove documents that were
                 shows the number of all available documents          too incomprehensible or those containing private
                 collected in an initial scan. The second column      information.
                 (Selected) is a selected subset from the available     To keep the annotation process simple for the
                 documents, due to budgetary constraints and to       annotators, the sentences were annotated (cor-
                 achieve a representative sample over all domains     rected) in a text editor and postprocessed auto-
                 anddataportions.Therelativelyhighernumberof          matically to retrieve and categorize the GEC edits
                 documents selected for the Native Web Informal
                 domain is due to its substantially shorter texts,       6https://github.com/moses-smt/mosesdecoder
                 yielding fewer sentences; also, we needed to pop-    /blob/master/scripts/tokenizer/detokenizer.perl.
                                                                         7Our annotators are senior undergraduate students of
                 ulate this part of the corpus as a completely new    humanities,regularlyemployedforvariousannotationefforts
                 domainwithnopreviously annotated data.               at our institute.
                                                                  455
The words contained in this file might help you see if this file matches what you are looking for:

...Czechgrammarerrorcorrectionwithalargeanddiversecorpus jakubnaplava milanstraka janastrakova alexandr rosen charles university faculty of mathematics and physics institute formal applied linguistics czech republic naplava straka strakova ufal mff cuni cz arts theoretical computational ff abstract cotet et al weintroduce a large diverse cor the lack adequate data is even more pusannotatedforgrammaticalerrorcorrection acute in languages other than english we aim to gec with contribute still address both issue scarcity non scarce resources this domain for lan ubiquitous need broad guages grammar coverage by presenting new error correction corpus geccc czechcorpus expertly annotated offersavarietyoffourdomains coveringerror distributions ranging from high density includes texts multiple domains essays written native speakers web total sentences being our knowl site whereerrorsareexpectedtobemuch edge thelargestnon englishgeccorpus aswell lesscommon wecompareseveralczechgec as one largest co...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area