Language Pdf 102345

Partial capture of text on file.
                    Sinhala-Tamil Machine Translation: Towards better Translation Quality
                     Randil Pushpananda                RuvanWeerasinghe                   MahesanNiranjan
                         Language Technology Research Laboratory                       School of Electronics and
                        University of Colombo School of Computing                          ComputerScience
                                             Sri Lanka                             University of Southampton, UK
                                    {rpn|arw}@ucsc.lk                                 mn@ecs.soton.ac.uk
                                         Abstract                              cordingto(Weerasinghe,2003),theSinhala-Tamil
                       Statistical Machine Translation (SMT) is                language pair gives better performance compared
                       a well-known and well established data-                 to the Sinhala-English pair in SMT since they
                       driven approach used for language trans-                are more closely related to each other owing to
                       lation.  The focus of this work is to de-               their evolution within Sri Lanka.        Some impor-
                       velop a statistical machine translation sys-            tant factors to consider when building SMT for the
                       temforSriLankanlanguages,Sinhalaand                     Sinhala-Tamil language pair have been identiﬁed
                       Tamil language pair. This paper presents                in (Sakthithasan et al., 2010). The limited amount
                       a systematic investigation of how Sinhala-              of data, and the restricted domain it represented,
                       Tamil SMT performance varies with the                   makesthatwordhardtogeneralize. Anotherstudy
                       amount of parallel training data used, in               (Jeyakaran and Weerasinghe, 2011), explored the
                       order to ﬁnd out the minimum needed to                  applicability of the Kernel Ridge Regression tech-
                       develop a machine translation system with               nique to Sinhala-Tamil translation. This research
                       acceptable performance.                                 resulted in a hybrid of classical phrase based SMT
                   1   Introduction                                            andKernelRidgeRegressionwithtwonovelsolu-
                                                                               tions for the pre-image problem.
                   Sri Lanka is a multi-ethnic, multi-lingual country.            Owing to the limited amount of parallel data
                   Sinhala and Tamil are the national languages of             available, it has been not possible to analyze how
                   Sri Lanka. The majority of Sri Lankans do not               the results vary with increasing numbers of paral-
                   have a good knowledge of languages other than               lel sentences in Sinhala and Tamil for general pur-
                   their mother tongue. Therefore a language barrier           pose MT.
                   between the Sinhala and Tamil communities ex-               2.1   Sinhala and Tamil Languages
                   ists. This language barrier and the problems that
                   arose during the last 30 years in the country, en-          Sinhala belongs to the Indo-Aryan language fam-
                   couraged us to a translation application using the          ily and Tamil to the Dravidian family. Both Sin-
                   SMTapproach. This would reduce the language                 halaandTamillanguagesaremorphologicallyrich
                   gap between these two communities and thereby               languages:    Sinhala has up to 110 noun word
                   help solve a burning issue in the country.                  forms and up to 282 verb word forms (Welgama
                      The choice of the Sinhala - Tamil language               et al., 2011) and Tamil has around 40 noun word
                   pair provides some opportunities as well as some            formsandupto240verbwordforms(Lushanthan,
                   challenges.    The opportunity is that they share           2010).    Also both these languages are syntacti-
                   someafﬁnity to each other, having evolved along-            cally similar. The typical word order of both these
                   side each other in Sri Lanka. The challenges in-            languages are Subject-Object-Verb. However both
                   clude the sparseness in the availability of data, and       are ﬂexible with the word order and variant word
                   the limited research undertaken in them. Hence,             orders are possible with discourse - pragmatic ef-
                   developing a successful system with limited re-             fects (Liyanage et al., 2012; Wikipedia, 2014).
                   sources is our ultimate goal.                                  In addition there are some of the aspects of
                   2   BackgroundandRelatedWork                                Tamil inﬂuence on the structure of the Sinhalese
                                                                               language. ThemostsigniﬁcantimpactofTamilon
                   There is very limited research reported in the lit-         Sinhalese has been at the lexical level (Karunati-
                   eratureforSinhala-Tamilmachinetranslation. Ac-              laka, 2011).     අම්මා (/amma/: mother), අක්කා
                   Randil Pushpananda, Ruvan Weerasinghe and Mahesan Niranjan. 2014. Sinhala-Tamil Machine Translation:
                   Towards better Translation Quality. In Proceedings of Australasian Language Technology Association
                   Workshop, pages 129−133.
                  (/akka/:   elder sister), අයියා (/ayya/:      elder     (WeerasingheandPushpananda,2013)whichcon-
                  brother) are some loan words out of more than           sists of 25500 parallel sentences. This is also an
                  thousand words borrowed from Tamil to Sinhala           open domain corpus which includes mainly news-
                  (Coperahewa and Arunachalam, 2011).                     paper texts and technical writing. The sentence
                                                                          length of sentences in this corpus was restricted to
                  3   Experiments and Results                             8 - 12 words. Both Sinhala to Tamil and Tamil
                  3.1   Tools used                                        to Sinhala translation models were built using this
                                                                          corpus. The characteristics of the Sinhala-Tamil
                  The open source statistical machine translation         parallel dataset is shown in Table 2
                  system: MOSES (Koehn et al., 2007) was used                               Total       Unique
                  with GIZA++ (Och and Ney, 2004) using the                   Language   Words(TW) Words(UW) UW/TW
                  standard alignment heuristic grow-diag-ﬁnal for             Sinhala    252,101      37,128       15%
                  wordalignments. Tri-gram language models were               Tamil      219,017      53,024       24%
                  trained on the target side of the parallel data and         Table 2: Characteristics of parallel dataset
                  the target language monolingual corpus by using
                  the Stanford Research Institute language Model-
                  ingtoolkit(Stolckeandothers,2002)withKneser-            3.2.1   Baseline Systems
                  Ney smoothing. The systems were tuned using a           Using the above parallel corpus, we trained two
                  small extracted parallel dataset with Minimum Er-       baseline systems: Sinhala to Tamil and Tamil to
                  ror Rate Training (MERT)(Och, 2003) and then            Sinhala.   First, 500 parallel sentences were ex-
                  tested with different test sets. Finally, the Bilin-    tracted randomly as the tuning dataset.        Then
                  gual Evaluation Understudy (BLEU) (Papineni et          of the remaining 25000 parallel sentences, 5000
                  al., 2002) evaluation metric was used to evaluate       sentences were extracted randomly as the initial
                  the output produced by the translation system.          dataset. By applying 10-fold cross-validation (Ko-
                  3.2   DataCollection and Data Preprocessing             havi and others, 1995) (to get an unbiased result),
                                                                          we divided extracted 5000 sentences into 10 mu-
                  Tobuildagoodbaselinesystem,weneedtohavea                tually exclusive partitions equally and then one of
                  sentence-aligned parallel corpus to train the trans-    the partitions was used as the testing data and the
                  lation model and a (possibly larger) monolingual        other nine used as training data. Then we trained
                  corpus of the target language to train the language     and evaluated the system iteratively for all com-
                  model.                                                  binations of the datasets and ﬁnally calculated the
                                                                          average performance of the results in order to ob-
                    Language               Characteristics                tain unbiased estimates of accuracy. We repeated
                               Total Words  UniqueWords Sentences         the same procedure by adding 5000 more sen-
                    Sinhala    10,142,501   448,651        850,000
                    Tamil      4,288,349    400,293        407,578        tences to the initial dataset each time until the re-
                  Table 1: Characteristics of Sinhala and Tamil           maining dataset was empty.
                  Monolingual Corpora                                       Results    Figure 1 shows the average BLEU
                                                                          score value variation against the number of paral-
                                        1                                 lel sentences in both Sinhala to Tamil and Tamil to
                    WeusedtheUCSC 10MwordsSinhalaCorpus                   Sinhala translation. However, it clearly indicates
                  (Weerasingheetal.,2007)andthe4MwordsTamil               that much more data would be required to build
                  Corpus (Weerasinghe et al., 2013) to build the          an acceptable translation model for the Sinhala-
                  Sinhala and Tamil language models respectively.         Tamil language pair. The results of the Tamil to
                  Both these are open domain corpora mainly with          Sinhala translation system in ﬁgure 1 shows that
                  newspaper articles and Technical writing.       The     the BLEUscoreapproaches12.9whenthedataset
                  characteristics of the Sinhala and Tamil corpora is     size reaches 25000. It also shows that results of
                  showninTable1.                                          the Sinhala to Tamil translation only approaches
                    Finding a good large Sinhala-Tamil parallel           10.1 for the full dataset of 25000 parallel sen-
                  corpus was the main difﬁculty.       For this pur-      tences. The ﬁgure 1 shows that when the dataset
                  posewecollectedaSinhala-TamilParallelCorpus             size is increased from 5000 to 10,000 and 10,000
                     1University of Colombo School of Computing           to 20,000, the increase in performance varies by
                                                                      130
                                                14
                                                12
                                              ) 10
                                              %
                                              (
                                               
                                              e  8
                                              r
                                              o
                                              c
                                              S
                                                 6
                                              U
                                              E
                                              L
                                              B  4
                                                                                                                       SI-TA         TA-SI
                                                 2
                                                 0
                                                    0            5000           10000          15000          20000          25000          30000
                                                                                   Number of Sentences
                                                  Figure 1: Average BLEU Score VS Number of Parallel Sentences
                       around 2 BLEU points for Sinhala to Tamil trans-                               • Calculate the number of total words(TotW)
                       lation and around 2 to 3 BLEU points for Tamil                                    and unique words(UniW) in each training
                       to Sinhala translation. This is consistent with the                               (Tr) and test (Te) datasets.
                       results reported by Turchi et al. (2012).                                      • Calculate the number of out-of-vocabulary
                                        Sample       Average          Outof                              (OOV)wordsinthetestdataset(asapercent-
                          Language      Size (S)    Perplexity     Vocabulary      OOV/S                 age of test dataset).
                                                                      (OOV)
                          Sinhala        5000       1590.10        962             19%                • Calculate the number of untranslated words
                                        25000        997.33        2225             9%                   (UntransW)(as a percentage of test dataset).
                          Tamil          5000       6067.65        1295            26%
                                        25000       3819.94        3593            14%                • Calculate the number of translated words
                       Table 3: Average perplexity values and out-of-                                    whicharenotinthereference dataset (Targe-
                       vocabulary values of the Sinhala-Tamil Parallel                                   tOOV)(as a percentage of test dataset).
                       Corpus                                                                         • Calculate the number of translated words
                           Also, as shown in table 3, we can clearly see                                 which are not in the target language model
                       that as the number of sentences are increased, the                                (Target LM OOV) (as a percentage of refer-
                       average perplexity for both Sinhala and Tamil de-                                 ence dataset).
                       creases. Sinhala and Tamil datasets were consid-
                       ered separately from the parallel corpus to calcu-                                 Description               5000                25000
                       late the perplexity values. These values are very                                                       TotW     UniW       TotW       UniW
                                                                                                      Training Dataset        44,806    13,723    224,959    34,858
                       high compared to those of the dominant European                                Testing Dataset         4,985     2,884     24,678     8,890
                       languages.                                                                     OOV(%)                  19.70     33.29     9.41       25.11
                           Here we did an error analysis to identify the                              UntransW(%)             33.78     52.98     17.82      44.26
                       problems of the methods we used and to ﬁnd new                                 Reference Dataset       3,168     1,307     17,584     4,298
                       methodologies to improve the results.                                          TargetOOV(%)            17.65     19.15     9.58       17.43
                                                                                                      Target LM OOV(%)        0.29      0.33      1          1.55
                       4     Error Analysis                                                       Table 4: Results obtained from the error analysis
                                                                                                  of Sinhala to Tamil translation
                       The BLEU scores for test sets of 5000 and 25000
                       datasamplesweretakenfortheerroranalysis. The                                   The results obtained for the Sinhala to Tamil
                       process for the error analysis stated as follows.                          and Tamil to Sinhala translations are shown in
                                                                                            131
                                                 5000             25000             language model since words in the Sinhala mono-
                           Description       TotW    UniW     TotW     UniW
                       Training Dataset     39,044   16,328  194,784   49,402       lingual corpus is more than twice as the words in
                       Testing Dataset      4,336    2,968   21,462    10,381       the Tamil monolingual corpus. When consider the
                       OOV(%)               30.32    43.67   16.84     33.85        Target OOV and Target LM OOV in Tamil to Sin-
                       UntransW(%)          40.68    57.14   25.08     48.58        hala Translation is lower compared to the Sinhala
                       Reference Dataset    3,168    1,307   17,584    4,298        to Tamil translation. That could be a another rea-
                       TargetOOV(%)         10.88    14.94   5.01      11.45        son to get a higher BLEU score value for Tamil to
                       Target LM OOV(%)     0.04     0.07    0.15      0.44         Sinhala translation.
                    Table 5: Results obtained from the error analysis
                    of Tamil to Sinhala translation                                 5 ConclusionandFutureWork
                    table 4 and 5 respectively.          When considering           The purpose of this research was to ﬁnd out how
                    the 5000 and 25000 datasets in table 4 and 5,                   the SMTsystemsperformforSinhalatoTamiland
                    we can see that the total number of words in                    TamiltoSinhalatranslation. We can conclude that
                    the Tamil to Sinhala translation is lower than the              while Tamil to Sinhala and Sinhala to Tamil trans-
                    Sinhala to Tamil translation in both training and               lation is unable to produce intelligible output with
                    testing datasets. However the unique number of                  parallel corpus of just 25000 sentence pairs of rel-
                    words in the Tamil to Sinhala translation is much               atively short length, we can expect performance to
                    higher than the Sinhala to Tamil translation. This              approach usable levels by collecting a large par-
                    clearly shows the complexity of the Tamil lan-                  allel corpora. Using this experience, we are cur-
                    guage. However, as we expected OOV (unique                      rently collecting a more balanced parallel corpus.
                    word) rate is reduced by 8% - 10%, when the                        However the error analysis shows that the sen-
                    dataset size is increasing. That is one of the rea-             tence length limitations of the Sinhala-Tamil par-
                    sons for the increment of BLEU score value. We                  allel corpus could not be the only reason for the
                    have identiﬁed mainly two problems. According                   comparatively lower BLEU scores, morphologi-
                    to table 4, 20% of unique words in the test set                 cal richness may be the reason to get lower re-
                    are not translated even they were in the training               sults since misspelled words and proper names are
                    set and 17% to 19% of words which are not in                    common to other languages too. Furthermore, a
                    the target reference set is in the translated output.           preliminarystudyshowsthatwecangetbetterper-
                    Those are occurred due to phrase alignment prob-                plexity values for the same dataset we used for this
                    lems and also the decoding problems. For an ex-                 research by stemming sufﬁxes of the Sinhala and
                    ample if we need to translate ෙගදර (Home) to                    Tamil parallel sentences. In future, we are plan-
                    Tamil, the phrase table consists only ෙගදර එන්න                 ningtoinvestigateandﬁndsolutionstotheseprob-
                    (Come home) and ෙගදර යන්න (Go home), then                       lems and planning to implement a system capable
                    that wordwillnotbetranslatedeventhatwordisin                    of producing acceptable translations between Sin-
                    the training set. Since Sinhala and Tamil are low-              hala and Tamil for use by the wider community.
                    resourced languages, we need to consider these is-
                    sues to build a good translation system. We can                 Acknowledgment
                    clearly see that out-of-vocabulary rate and the un-
                    translated word rate is much higher in Tamil to
                    Sinhala Translation. Also when we consider the                  The authors would like to thank the anonymous
                    out-of-vocabularywords,wehavefoundthatthose                     reviewers for their helpful comments and sugges-
                    words consist of proper names, misspelled words,                tions. This work was supported by the National
                    inﬂections, derivatives and honoriﬁcs. These are                Research Council, ICT Agency and LK Domain
                    the main problems that we could identify from                   Registry of Sri Lanka.         The authors are grate-
                    the error analysis. Since human evaluation is very              ful to past and current members of the Language
                    costly, we used only the above technique to do the              Technology Research Laboratory of the UCSC,
                    evaluation. According to the ﬁgure 1, we can see                Sri Lanka for their signiﬁcant contribution in de-
                    that even the OOV words are higher, BLEU score                  veloping the basic linguistic resources needed to
                    values of Tamil to Sinhala translation is higher.               carry out the research described above.
                    The main reason for this could be the size of the
                                                                               132
The words contained in this file might help you see if this file matches what you are looking for:

...Sinhala tamil machine translation towards better quality randil pushpananda ruvanweerasinghe mahesanniranjan language technology research laboratory school of electronics and university colombo computing computerscience sri lanka southampton uk rpn arw ucsc lk mn ecs soton ac abstract cordingto weerasinghe thesinhala statistical smt is pair gives performance compared a well known established data to the english in since they driven approach used for trans are more closely related each other owing lation focus this work de their evolution within some impor velop sys tant factors consider when building temforsrilankanlanguages sinhalaand have been identied paper presents sakthithasan et al limited amount systematic investigation how restricted domain it represented varies with makesthatwordhardtogeneralize anotherstudy parallel training jeyakaran explored order nd out minimum needed applicability kernel ridge regression tech develop system nique acceptable resulted hybrid classical phras...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area