134x Filetype PDF File size 1.13 MB Source: research.rug.nl
University of Groningen Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages Dhar, Prajit; Bisazza, Arianna; van Noord, Gertjan Published in: Proceedings of the 8th Workshop on Asian Translation (WAT2021) IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2021 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Dhar, P., Bisazza, A., & van Noord, G. (2021). Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages. In T. Nakazawa, H. Nakayama, I. Goto, H. Mino, C. Ding, R. Dabre, A. Kunchukuttan, S. Higashiyama, H. Manabe, W. Pa Pa, S. Parida, O. Bojar, C. Chu, A. Eriguchi, K. Abe, Y. Oda, K. Sudoh, S. Kurohashi, & P. Bhattacharyya (Eds.), Proceedings of the 8th Workshop on Asian Translation (WAT2021) (pp. 181-190). Association for Computational Linguistics (ACL). Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne- amendment. Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 23-09-2022 OptimalWordSegmentationfor Neural Machine Translation into Dravidian Languages Prajit Dhar AriannaBisazza Gertjan van Noord University of Groningen {p.dhar, a.bisazza, g.j.m.van.noord}@rug.nl Abstract instance that an average English sentence contains almost ten times as many words as its Kannada Dravidian languages, such as Kannada and equivalent. For the other three languages, the ra- Tamil, are notoriously difficult to translate tio is a bit smaller but the difference with English by state-of-the-art neural models. This stems remains considerable. This indicates why it is im- from the fact that these languages are mor- portant to consider word segmentation algorithms phologically very rich as well as being low- resourced. In this paper, we focus on subword as part of the translation system. segmentationandevaluateLinguisticallyMoti- In this paper we describe our work on Neural vated Vocabulary Reduction (LMVR) against MachineTranslation(NMT)fromEnglishintothe the more commonly used SentencePiece (SP) Dravidian languages Kannada, Malayalam, Tamil for the task of translating from English into and Telugu. We investigated the optimal transla- four different Dravidian languages. Addition- tionsettingsforthepairsandinparticularlookedat ally weinvestigatetheoptimalsubwordvocab- theeffectofwordsegmentation. Theaimofthepa- ulary size for each language. We find that SP is the overall best choice for segmentation, and per is to answer the following research questions: that larger subword vocabulary sizes lead to higher translation quality. • Does LMVR, a linguistically motivated wordsegmentationalgorithm,outperformthe 1 Introduction purely data-driven SentencePiece? Dravidian languages are an important family of • What is the optimal subword dictionary size languages spoken by about 250 million of people fortranslatingfromEnglishintotheseDravid- primarily located in Southern India and Sri Lanka ian languages? (Steever,2019). Kannada(KN),Malayalam(MA), Tamil (TA) and Telugu (TE) are the four most In what follows, we review the relevant previ- spoken Dravidian languages with approximately ous work (Sect. 2), introduce the two segmenters 47, 34, 71 and 79 million native speakers, respec- (Sect. 3), describe the experimental setup (Sect. 4), tively. Together, they account for 93% of all Dra- andpresentouranswerstotheaboveresearchques- vidian language speakers. While Kannada, Malay- tions (Sect. 5). alam and Tamil are classified as South Dravidian languages, Telugu is a part of South-Central Dra- 2 PreviousWork vidian languages. All four languages are SOV 2.1 Translation Systems (Subject-Object-Verb) languages with free word order. Theyarehighlyagglutinativeandinflection- Statistical MachineTranslation Oneoftheear- ally rich languages. Additionally, each language liest automatic translation systems for English into has a different writing system. Table 1 presents a Dravidian language was the English→Tamil sys- an English sentence example and its Dravidian- tem by Germann (2001). They trained a hy- language translations. brid rule-based/statistical machine translation sys- The highly complex morphology of the Dravid- tem that was trained on only 5k English-Tamil ian languages under study is illustrated if we com- parallel sentences. Ramasamy et al. (2012) cre- pare translated sentence pairs. The analysis of our ated SMTsystems(phrase-basedandhierarchical) parallel datasets (section 4.1, Table 3) shows for which were trained on a dataset of 190k parallel 181 Proceedings of the 8th Workshop on Asian Translation, pages 181–190 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics EN HewasborninThirukkuvalaivillageinNagapattinamDistrict on 3rd June, 1924. ɲ ಅವರು³ಗಪಟಣಂÎĻಯÖರುಕುವಲȗ¢ɻಮದʉ1924ರಜೂȑ3ರಂದುಜÚèದರು. ɫ ɡ KN avaru nāgapattanam jilleya tirukkuvalay grāmadalli 1924ra jūn 3randu janisiddaru. ̣̣ ̣ ് ് 1924ല്നാഗപണംജിയിെലതിരുുവൈളwഗാമിലാണഅേഹംജനിത ML 1924l nāgapattanam jillayile tirukkuvalai grāmattilān addēham janiccat. ̣̣ ̣ ̣ ̣ நாகïபêனð மாவêடð Êå¾வைளå ராமìô அவò 1924-ஆð ஆë ஜூîமாதð3-ஆðேதறíதாò. TA nāgappattinam māvattam tirukkuvalaik kirāmattil avar 1924-ām āntu jūn mātam 3-ām tēti ̣̣ ̣̣ ̣ ̣̣ pirantār. ̊ ఆయనƵగపట˸ణంǍƾǕ̡͞ˮǀȪƤ̫మంʖ1924˧˕3నజǙ̆ంƧ͞. ౖ TE āyana nāgapattanam jillā tirukkuvālai grāmanlō 1924 jūn 3na janmincāru. ̣̣ ̣ Table 1: Example sentence in English along with its translation and transliteration in the four Dravidian languages. sentences (henceforth referred to as UFAL). They findings was also reported by Ramesh et al. (2020) also reported that applying pre-processing steps in- for Tamil and DandapatandFedermann(2018)for volving morphological rules based on Tamil suf- Telugu . fixes improved the BLEU score of the baseline To the best of our knowledge and as of 2021, model to a small extent (from 9.42 to 9.77). For therehasnotbeenanyscientificpublicationinvolv- the Indic languages multilingual tasks of WAT- ing translation to and from Kannada, except for 2018,thePhrasal-basedSMTsystemofOjhaetal. Chakravarthietal.(2019). Onepossiblereasonfor (2018) with a BLEU score of 30.53. this could be the fact that sizeable corpora involv- Subsequent papers also focused on SMT sys- ing Kannada (i.e. in the order of magnitude of at temsforMalayalamandTeluguwithsomenotable least thousand sentences) have been readily avail- workincluding: (AntoandNisha,2016;Sreelekha ableonlysince2019,withthereleaseoftheJW300 and Bhattacharyya, 2017, 2018) for Malayalam Corpus (Agić and Vulić, 2019). and (Lingam et al., 2014; Yadav and Lingam, Multilingual NMT Since 2018 several studies 2017) for Telugu. havepresentedmultilingualNMTsystemsthatcan Neural Machine Translation On the neural handle English → Malayalam, Tamil and Telugu machine translation (NMT) side, there have translation (Dabre et al., 2018; Choudhary et al., been a handful of NMT systems trained on 2020; Ojha et al., 2018; Sen et al., 2018; Yu et al., English→Tamil. On the aforementioned Indic 2020;DabreandChakrabarty,2020). Inparticular, languages multilingual tasks of WAT-2018, Sen Senetal.(2018)presentedresultswheretheBLEU et al. (2018), Dabre et al. (2018) reported only score improved whencomparingmonolingualand 11.88 and 18.60 BLEU scores, respectively, for multilingual models. Conversely, Yu et al. (2020) English→Tamil. The poor performance of these found that NMT systems that were multi-way (In- systems compared to the 30.53 BLEU score of the dic ↔ Indic) performed worse than English ↔ In- SMTsystem(Ojhaetal.,2018)showedthatthose dic systems. NMTsystemswerenotyetsuitablefortranslating To our knowledge, no work so far has explored into the morphologically rich Tamil. theeffectofthesegmentationalgorithmanddictio- However,thefollowingyear,Philipetal.(2019) nary size on the four languages: Kannada, Malay- outperformed Ramasamy et al. (2012) on the alam, Tamil and Telugu. UFALdatasetwithaBLEUscoreof13.05(thepre- vious best score on this test set was 9.77). They 3 SubwordSegmentationTechniques report that techniques such as domain adaptation and back-translation can make training NMT sys- Prior to the emergence of subword segmenters, tems on low-resource languages possible. Similar translation systems were plagued with the issue of 182 Available in: Name Domain Kannada Malayalam Tamil Telugu Bible Religion 18 1 14 ELRC COVID-19 <1 <1 <1 GNOME Technical <1 <1 <1 <1 JW300 Religion 70 45 52 45 KDE Technical 1 <1 <1 <1 NLPC General <1 OpenSubtitles Cinema 26 3 3 CVIT-PIB Press 5 10 10 PMIndia Politics 10 4 3 8 Tanzil Religion 18 9 Tatoeba General <1 <1 <1 <1 Ted2020 General <1 <1 <1 1 TICO-19 COVID-19 <1 Ubuntu Technical <1 <1 <1 <1 UFAL Mixed 11 Wikimatrix General <1 10 18 Wikititles General 1 Table2: Compositionoftrainingcorpora. Thenumbersindicatetherelativesize(inpercentages)ofthecorrespond- ing part for that language. out-of-vocabulary (OOV) tokens. This was partic- To address this, Ataman et al. (2017) proposed a ularly an issue for translations involving agglutina- modificationofMorfessorFlatCat(Grönroosetal., tive languages such as Turkish (Ataman and Fed- 2014), called Linguistically Motivated Vocabu- erico, 2018) or Malayalam (Manohar et al., 2020). lary Reduction (LMVR). Specifically, LMVR Varioussegmentationalgorithmswerebroughtfor- imposes an extra condition on the cost function of wardtocircumventthis issue and in turn, improve Morfessor Flatcat so as to favour vocabularies of translation quality. thedesiredsize. InacomparisonofLMVRtoBPE, Ataman et al. (2017) reported a +2.3 BLEU im- PerhapsthemostwidelyusedalgorithminNMT provement on the English-Turkish translation task to date is the language-agnostic Byte Pair Encod- of WMT18. ing (BPE) by Sennrich et al. (2016). Initially pro- Given the encouraging results reported on the posed by Gage (1994), BPE was repurposed by agglutinative Turkish language, we hypothesise Sennrich et al. (2016) for the task of subword that translation into Dravidian languages may also segmentation, and is based on a simple principle benefit from a linguistically motivated segmenter, whereby pairs of character sequences that are fre- andevaluate LMVRagainstSPacrossvaryingvo- quently observed in a corpus get merged itera- cabulary sizes. tively until a predetermined dictionary size is at- tained. In this paper we use a popular implemen- 4 ExperimentalSetup tation of BPE, called SentencePiece (SP) (Kudo and Richardson, 2018). 4.1 Training Corpora Theparallel training data is mostly taken from the While purely statistical algorithms are able to datasets available for the MultiIndicMT task from segment any token into smaller segments, there is WAT 2021. If a certain dataset is not available no guarantee that the generated tokens will be lin- from the MultiIndicMT training repository, we re- guistically sensible. Unsupervised morphological sorted to extract that dataset from OPUS (Tiede- induction is a rich area of research that also aims mann, 2012) or WMT20. Table 2 reports on the at learning a segmentation from data, but in a lin- datasets that we used along with their domain and guistically motivated way. The most well-known their source. example is Morphessor with its different variants After extracting and cleaning the data (see be- (Creutz and Lagus, 2002; Kohonen et al., 2010; low), approximately 8 million English tokens and Grönroos et al., 2014). An important obstacle to their corresponding target language tokens are se- applying Morfessor to the task of NMT is the lack lected as our training corpora. We fixed the num- of a mechanism to determine the dictionary size. ber of source tokens across language pairs in or- 183
no reviews yet
Please Login to review.