jagomart
digital resources
picture1_Wat 21


 134x       Filetype PDF       File size 1.13 MB       Source: research.rug.nl


File: Wat 21
university of groningen optimal word segmentation for neural machine translation into dravidian languages dhar prajit bisazza arianna van noord gertjan published in proceedings of the 8th workshop on asian translation ...

icon picture PDF Filetype PDF | Posted on 23 Sep 2022 | 3 years ago
Partial capture of text on file.
                                                                        
        University of Groningen
        Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages
        Dhar, Prajit; Bisazza, Arianna; van Noord, Gertjan
        Published in:
        Proceedings of the 8th Workshop on Asian Translation (WAT2021)
        IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
        it. Please check the document version below.
        Document Version
        Publisher's PDF, also known as Version of record
        Publication date:
        2021
        Link to publication in University of Groningen/UMCG research database
           Citation for published version (APA):
           Dhar, P., Bisazza, A., & van Noord, G. (2021). Optimal Word Segmentation for Neural Machine Translation
           into Dravidian Languages. In T. Nakazawa, H. Nakayama, I. Goto, H. Mino, C. Ding, R. Dabre, A.
           Kunchukuttan, S. Higashiyama, H. Manabe, W. Pa Pa, S. Parida, O. Bojar, C. Chu, A. Eriguchi, K. Abe, Y.
           Oda, K. Sudoh, S. Kurohashi, & P. Bhattacharyya (Eds.), Proceedings of the 8th Workshop on Asian
           Translation (WAT2021) (pp. 181-190). Association for Computational Linguistics (ACL).
        Copyright
        Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
        author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
        The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
        More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
        amendment.
        Take-down policy
        If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
        and investigate your claim.
        Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
        number of authors shown on this cover page is limited to 10 maximum.
        Download date: 23-09-2022
                                                 OptimalWordSegmentationfor
                                 Neural Machine Translation into Dravidian Languages
                                      Prajit Dhar          AriannaBisazza             Gertjan van Noord
                                                            University of Groningen
                                                 {p.dhar, a.bisazza, g.j.m.van.noord}@rug.nl
                                       Abstract                             instance that an average English sentence contains
                                                                            almost ten times as many words as its Kannada
                      Dravidian languages, such as Kannada and
                                                                            equivalent. For the other three languages, the ra-
                      Tamil, are notoriously difficult to translate
                                                                            tio is a bit smaller but the difference with English
                      by state-of-the-art neural models. This stems
                                                                            remains considerable. This indicates why it is im-
                      from the fact that these languages are mor-
                                                                            portant to consider word segmentation algorithms
                      phologically very rich as well as being low-
                      resourced. In this paper, we focus on subword         as part of the translation system.
                      segmentationandevaluateLinguisticallyMoti-
                                                                               In this paper we describe our work on Neural
                      vated Vocabulary Reduction (LMVR) against
                                                                            MachineTranslation(NMT)fromEnglishintothe
                      the more commonly used SentencePiece (SP)
                                                                            Dravidian languages Kannada, Malayalam, Tamil
                      for the task of translating from English into
                                                                            and Telugu. We investigated the optimal transla-
                      four different Dravidian languages. Addition-
                                                                            tionsettingsforthepairsandinparticularlookedat
                      ally weinvestigatetheoptimalsubwordvocab-
                                                                            theeffectofwordsegmentation. Theaimofthepa-
                      ulary size for each language. We find that SP
                      is the overall best choice for segmentation, and      per is to answer the following research questions:
                      that larger subword vocabulary sizes lead to
                      higher translation quality.                              • Does LMVR, a linguistically motivated
                                                                                 wordsegmentationalgorithm,outperformthe
                  1    Introduction
                                                                                 purely data-driven SentencePiece?
                  Dravidian languages are an important family of
                                                                               • What is the optimal subword dictionary size
                  languages spoken by about 250 million of people
                                                                                 fortranslatingfromEnglishintotheseDravid-
                  primarily located in Southern India and Sri Lanka
                                                                                 ian languages?
                  (Steever,2019). Kannada(KN),Malayalam(MA),
                  Tamil (TA) and Telugu (TE) are the four most
                                                                            In what follows, we review the relevant previ-
                  spoken Dravidian languages with approximately
                                                                            ous work (Sect. 2), introduce the two segmenters
                  47, 34, 71 and 79 million native speakers, respec-
                                                                            (Sect. 3), describe the experimental setup (Sect. 4),
                  tively. Together, they account for 93% of all Dra-
                                                                            andpresentouranswerstotheaboveresearchques-
                  vidian language speakers. While Kannada, Malay-
                                                                            tions (Sect. 5).
                  alam and Tamil are classified as South Dravidian
                  languages, Telugu is a part of South-Central Dra-         2 PreviousWork
                  vidian languages.     All four languages are SOV
                                                                            2.1   Translation Systems
                  (Subject-Object-Verb) languages with free word
                  order. Theyarehighlyagglutinativeandinflection-           Statistical MachineTranslation        Oneoftheear-
                  ally rich languages. Additionally, each language          liest automatic translation systems for English into
                  has a different writing system. Table 1 presents          a Dravidian language was the English→Tamil sys-
                  an English sentence example and its Dravidian-            tem by Germann (2001).          They trained a hy-
                  language translations.                                    brid rule-based/statistical machine translation sys-
                     The highly complex morphology of the Dravid-           tem that was trained on only 5k English-Tamil
                  ian languages under study is illustrated if we com-       parallel sentences. Ramasamy et al. (2012) cre-
                  pare translated sentence pairs. The analysis of our       ated SMTsystems(phrase-basedandhierarchical)
                  parallel datasets (section 4.1, Table 3) shows for        which were trained on a dataset of 190k parallel
                                                                        181
                                          Proceedings of the 8th Workshop on Asian Translation, pages 181–190
                              Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics
                   EN HewasborninThirukkuvalaivillageinNagapattinamDistrict on 3rd June, 1924.
                                                                                                               ɲ
                          ಅವರು³ಗಪಟಣಂÎĻಯÖರುಕುವಲȗ¢ɻಮದʉ1924ರಜೂȑ3ರಂದುಜÚèದರು.
                                         ɫ                 ɡ
                   KN
                          avaru nāgapattanam jilleya tirukkuvalay grāmadalli 1924ra jūn 3randu janisiddaru.
                                         ̣̣ ̣
                                                                                                  ്                      ്
                          1924ല്നാഗപ‡ണംജി’യിെലതിരുുവൈളwഗാമŠിലാണഅേœഹംജനി„ത￿
                   ML
                          1924l nāgapattanam jillayile tirukkuvalai grāmattilān addēham janiccat.
                                         ̣̣ ̣                      ̣             ̣
                          நாகïபê—னð மாவêடð ™Êå¾வைளå ’ராமì™ô அவò 1924-ஆð ஆëÂ
                          ஜூîமாதð3-ஆðேத™œறíதாò.
                   TA
                          nāgappattinam māvattam tirukkuvalaik kirāmattil avar 1924-ām          āntu jūn mātam 3-ām tēti
                                    ̣̣           ̣̣             ̣                                  ̣̣
                          pirantār.
                                                ̊
                          ఆయనƵగపట˸ణంǍƾǕ̡͞ˮǀȪƤ̫మంʖ1924˧˕3నజǙ̆ంƧ͞.
                                                            ౖ
                   TE
                          āyana nāgapattanam jillā tirukkuvālai grāmanlō 1924 jūn 3na janmincāru.
                                         ̣̣ ̣
                 Table 1: Example sentence in English along with its translation and transliteration in the four Dravidian languages.
                  sentences (henceforth referred to as UFAL). They       findings was also reported by Ramesh et al. (2020)
                  also reported that applying pre-processing steps in-   for Tamil and DandapatandFedermann(2018)for
                 volving morphological rules based on Tamil suf-         Telugu .
                  fixes improved the BLEU score of the baseline             To the best of our knowledge and as of 2021,
                  model to a small extent (from 9.42 to 9.77). For       therehasnotbeenanyscientificpublicationinvolv-
                  the Indic languages multilingual tasks of WAT-         ing translation to and from Kannada, except for
                  2018,thePhrasal-basedSMTsystemofOjhaetal.              Chakravarthietal.(2019). Onepossiblereasonfor
                 (2018) with a BLEU score of 30.53.                      this could be the fact that sizeable corpora involv-
                    Subsequent papers also focused on SMT sys-           ing Kannada (i.e. in the order of magnitude of at
                  temsforMalayalamandTeluguwithsomenotable               least thousand sentences) have been readily avail-
                 workincluding: (AntoandNisha,2016;Sreelekha             ableonlysince2019,withthereleaseoftheJW300
                  and Bhattacharyya, 2017, 2018) for Malayalam           Corpus (Agić and Vulić, 2019).
                  and (Lingam et al., 2014; Yadav and Lingam,
                                                                         Multilingual NMT Since 2018 several studies
                  2017) for Telugu.
                                                                         havepresentedmultilingualNMTsystemsthatcan
                  Neural Machine Translation         On the neural       handle English → Malayalam, Tamil and Telugu
                  machine translation (NMT) side, there have
                                                                         translation (Dabre et al., 2018; Choudhary et al.,
                  been a handful of NMT systems trained on
                                                                         2020; Ojha et al., 2018; Sen et al., 2018; Yu et al.,
                  English→Tamil.      On the aforementioned Indic        2020;DabreandChakrabarty,2020). Inparticular,
                  languages multilingual tasks of WAT-2018, Sen
                                                                         Senetal.(2018)presentedresultswheretheBLEU
                  et al. (2018), Dabre et al. (2018) reported only
                                                                         score improved whencomparingmonolingualand
                 11.88 and 18.60 BLEU scores, respectively, for
                                                                         multilingual models. Conversely, Yu et al. (2020)
                  English→Tamil. The poor performance of these           found that NMT systems that were multi-way (In-
                  systems compared to the 30.53 BLEU score of the        dic ↔ Indic) performed worse than English ↔ In-
                  SMTsystem(Ojhaetal.,2018)showedthatthose
                                                                         dic systems.
                  NMTsystemswerenotyetsuitablefortranslating
                                                                            To our knowledge, no work so far has explored
                  into the morphologically rich Tamil.
                                                                         theeffectofthesegmentationalgorithmanddictio-
                    However,thefollowingyear,Philipetal.(2019)
                                                                         nary size on the four languages: Kannada, Malay-
                  outperformed Ramasamy et al. (2012) on the
                                                                         alam, Tamil and Telugu.
                  UFALdatasetwithaBLEUscoreof13.05(thepre-
                 vious best score on this test set was 9.77). They
                                                                         3 SubwordSegmentationTechniques
                  report that techniques such as domain adaptation
                  and back-translation can make training NMT sys-        Prior to the emergence of subword segmenters,
                  tems on low-resource languages possible. Similar       translation systems were plagued with the issue of
                                                                     182
                                                                                     Available in:
                                           Name            Domain
                                                                        Kannada     Malayalam    Tamil    Telugu
                                           Bible           Religion     18          1                     14
                                           ELRC            COVID-19                 <1           <1      <1
                                           GNOME           Technical    <1          <1           <1      <1
                                           JW300           Religion     70          45           52      45
                                           KDE             Technical    1           <1           <1      <1
                                           NLPC            General                               <1
                                           OpenSubtitles   Cinema                   26           3        3
                                           CVIT-PIB        Press                    5            10      10
                                           PMIndia         Politics     10          4            3        8
                                           Tanzil          Religion                 18           9
                                           Tatoeba         General      <1          <1           <1      <1
                                           Ted2020         General      <1          <1           <1      1
                                           TICO-19         COVID-19                              <1
                                           Ubuntu          Technical    <1          <1           <1      <1
                                           UFAL            Mixed                                 11
                                           Wikimatrix      General                  <1           10      18
                                           Wikititles      General                               1
                   Table2: Compositionoftrainingcorpora. Thenumbersindicatetherelativesize(inpercentages)ofthecorrespond-
                   ing part for that language.
                   out-of-vocabulary (OOV) tokens. This was partic-            To address this, Ataman et al. (2017) proposed a
                   ularly an issue for translations involving agglutina-        modificationofMorfessorFlatCat(Grönroosetal.,
                   tive languages such as Turkish (Ataman and Fed-              2014), called Linguistically Motivated Vocabu-
                   erico, 2018) or Malayalam (Manohar et al., 2020).            lary Reduction (LMVR). Specifically, LMVR
                   Varioussegmentationalgorithmswerebroughtfor-                 imposes an extra condition on the cost function of
                   wardtocircumventthis issue and in turn, improve              Morfessor Flatcat so as to favour vocabularies of
                   translation quality.                                         thedesiredsize. InacomparisonofLMVRtoBPE,
                                                                                Ataman et al. (2017) reported a +2.3 BLEU im-
                      PerhapsthemostwidelyusedalgorithminNMT
                                                                                provement on the English-Turkish translation task
                   to date is the language-agnostic Byte Pair Encod-
                                                                                of WMT18.
                   ing (BPE) by Sennrich et al. (2016). Initially pro-
                                                                                  Given the encouraging results reported on the
                   posed by Gage (1994), BPE was repurposed by
                                                                                agglutinative Turkish language, we hypothesise
                   Sennrich et al. (2016) for the task of subword
                                                                                that translation into Dravidian languages may also
                   segmentation, and is based on a simple principle
                                                                                benefit from a linguistically motivated segmenter,
                   whereby pairs of character sequences that are fre-
                                                                                andevaluate LMVRagainstSPacrossvaryingvo-
                   quently observed in a corpus get merged itera-
                                                                                cabulary sizes.
                   tively until a predetermined dictionary size is at-
                   tained. In this paper we use a popular implemen-
                                                                                4 ExperimentalSetup
                   tation of BPE, called SentencePiece (SP) (Kudo
                   and Richardson, 2018).
                                                                                4.1   Training Corpora
                                                                               Theparallel training data is mostly taken from the
                      While purely statistical algorithms are able to           datasets available for the MultiIndicMT task from
                   segment any token into smaller segments, there is           WAT 2021. If a certain dataset is not available
                   no guarantee that the generated tokens will be lin-          from the MultiIndicMT training repository, we re-
                   guistically sensible. Unsupervised morphological             sorted to extract that dataset from OPUS (Tiede-
                   induction is a rich area of research that also aims          mann, 2012) or WMT20. Table 2 reports on the
                   at learning a segmentation from data, but in a lin-          datasets that we used along with their domain and
                   guistically motivated way. The most well-known               their source.
                   example is Morphessor with its different variants              After extracting and cleaning the data (see be-
                   (Creutz and Lagus, 2002; Kohonen et al., 2010;               low), approximately 8 million English tokens and
                   Grönroos et al., 2014). An important obstacle to             their corresponding target language tokens are se-
                   applying Morfessor to the task of NMT is the lack            lected as our training corpora. We fixed the num-
                   of a mechanism to determine the dictionary size.             ber of source tokens across language pairs in or-
                                                                           183
The words contained in this file might help you see if this file matches what you are looking for:

...University of groningen optimal word segmentation for neural machine translation into dravidian languages dhar prajit bisazza arianna van noord gertjan published in proceedings the th workshop on asian wat important note you are advised to consult publisher s version pdf if wish cite from it please check document below also known as record publication date link umcg research database citation apa p a g t nakazawa h nakayama i goto mino c ding r dabre kunchukuttan higashiyama manabe w pa parida o bojar chu eriguchi k abe y oda sudoh kurohashi bhattacharyya eds pp association computational linguistics acl copyright other than strictly personal use is not permitted download or forward distribute text part without consent author and holder unless work under an open content license like creative commons may be distributed here terms article fa dutch act indicated by taverne more information can found website https www rug nl library access self archiving pure amendment take down policy beli...

no reviews yet
Please Login to review.