149x Filetype PDF File size 0.37 MB Source: aclanthology.org
Sinhala-Tamil Machine Translation: Towards better Translation Quality Randil Pushpananda RuvanWeerasinghe MahesanNiranjan Language Technology Research Laboratory School of Electronics and University of Colombo School of Computing ComputerScience Sri Lanka University of Southampton, UK {rpn|arw}@ucsc.lk mn@ecs.soton.ac.uk Abstract cordingto(Weerasinghe,2003),theSinhala-Tamil Statistical Machine Translation (SMT) is language pair gives better performance compared a well-known and well established data- to the Sinhala-English pair in SMT since they driven approach used for language trans- are more closely related to each other owing to lation. The focus of this work is to de- their evolution within Sri Lanka. Some impor- velop a statistical machine translation sys- tant factors to consider when building SMT for the temforSriLankanlanguages,Sinhalaand Sinhala-Tamil language pair have been identified Tamil language pair. This paper presents in (Sakthithasan et al., 2010). The limited amount a systematic investigation of how Sinhala- of data, and the restricted domain it represented, Tamil SMT performance varies with the makesthatwordhardtogeneralize. Anotherstudy amount of parallel training data used, in (Jeyakaran and Weerasinghe, 2011), explored the order to find out the minimum needed to applicability of the Kernel Ridge Regression tech- develop a machine translation system with nique to Sinhala-Tamil translation. This research acceptable performance. resulted in a hybrid of classical phrase based SMT 1 Introduction andKernelRidgeRegressionwithtwonovelsolu- tions for the pre-image problem. Sri Lanka is a multi-ethnic, multi-lingual country. Owing to the limited amount of parallel data Sinhala and Tamil are the national languages of available, it has been not possible to analyze how Sri Lanka. The majority of Sri Lankans do not the results vary with increasing numbers of paral- have a good knowledge of languages other than lel sentences in Sinhala and Tamil for general pur- their mother tongue. Therefore a language barrier pose MT. between the Sinhala and Tamil communities ex- 2.1 Sinhala and Tamil Languages ists. This language barrier and the problems that arose during the last 30 years in the country, en- Sinhala belongs to the Indo-Aryan language fam- couraged us to a translation application using the ily and Tamil to the Dravidian family. Both Sin- SMTapproach. This would reduce the language halaandTamillanguagesaremorphologicallyrich gap between these two communities and thereby languages: Sinhala has up to 110 noun word help solve a burning issue in the country. forms and up to 282 verb word forms (Welgama The choice of the Sinhala - Tamil language et al., 2011) and Tamil has around 40 noun word pair provides some opportunities as well as some formsandupto240verbwordforms(Lushanthan, challenges. The opportunity is that they share 2010). Also both these languages are syntacti- someaffinity to each other, having evolved along- cally similar. The typical word order of both these side each other in Sri Lanka. The challenges in- languages are Subject-Object-Verb. However both clude the sparseness in the availability of data, and are flexible with the word order and variant word the limited research undertaken in them. Hence, orders are possible with discourse - pragmatic ef- developing a successful system with limited re- fects (Liyanage et al., 2012; Wikipedia, 2014). sources is our ultimate goal. In addition there are some of the aspects of 2 BackgroundandRelatedWork Tamil influence on the structure of the Sinhalese language. ThemostsignificantimpactofTamilon There is very limited research reported in the lit- Sinhalese has been at the lexical level (Karunati- eratureforSinhala-Tamilmachinetranslation. Ac- laka, 2011). අම්මා (/amma/: mother), අක්කා Randil Pushpananda, Ruvan Weerasinghe and Mahesan Niranjan. 2014. Sinhala-Tamil Machine Translation: Towards better Translation Quality. In Proceedings of Australasian Language Technology Association Workshop, pages 129−133. (/akka/: elder sister), අයියා (/ayya/: elder (WeerasingheandPushpananda,2013)whichcon- brother) are some loan words out of more than sists of 25500 parallel sentences. This is also an thousand words borrowed from Tamil to Sinhala open domain corpus which includes mainly news- (Coperahewa and Arunachalam, 2011). paper texts and technical writing. The sentence length of sentences in this corpus was restricted to 3 Experiments and Results 8 - 12 words. Both Sinhala to Tamil and Tamil 3.1 Tools used to Sinhala translation models were built using this corpus. The characteristics of the Sinhala-Tamil The open source statistical machine translation parallel dataset is shown in Table 2 system: MOSES (Koehn et al., 2007) was used Total Unique with GIZA++ (Och and Ney, 2004) using the Language Words(TW) Words(UW) UW/TW standard alignment heuristic grow-diag-final for Sinhala 252,101 37,128 15% wordalignments. Tri-gram language models were Tamil 219,017 53,024 24% trained on the target side of the parallel data and Table 2: Characteristics of parallel dataset the target language monolingual corpus by using the Stanford Research Institute language Model- ingtoolkit(Stolckeandothers,2002)withKneser- 3.2.1 Baseline Systems Ney smoothing. The systems were tuned using a Using the above parallel corpus, we trained two small extracted parallel dataset with Minimum Er- baseline systems: Sinhala to Tamil and Tamil to ror Rate Training (MERT)(Och, 2003) and then Sinhala. First, 500 parallel sentences were ex- tested with different test sets. Finally, the Bilin- tracted randomly as the tuning dataset. Then gual Evaluation Understudy (BLEU) (Papineni et of the remaining 25000 parallel sentences, 5000 al., 2002) evaluation metric was used to evaluate sentences were extracted randomly as the initial the output produced by the translation system. dataset. By applying 10-fold cross-validation (Ko- 3.2 DataCollection and Data Preprocessing havi and others, 1995) (to get an unbiased result), we divided extracted 5000 sentences into 10 mu- Tobuildagoodbaselinesystem,weneedtohavea tually exclusive partitions equally and then one of sentence-aligned parallel corpus to train the trans- the partitions was used as the testing data and the lation model and a (possibly larger) monolingual other nine used as training data. Then we trained corpus of the target language to train the language and evaluated the system iteratively for all com- model. binations of the datasets and finally calculated the average performance of the results in order to ob- Language Characteristics tain unbiased estimates of accuracy. We repeated Total Words UniqueWords Sentences the same procedure by adding 5000 more sen- Sinhala 10,142,501 448,651 850,000 Tamil 4,288,349 400,293 407,578 tences to the initial dataset each time until the re- Table 1: Characteristics of Sinhala and Tamil maining dataset was empty. Monolingual Corpora Results Figure 1 shows the average BLEU score value variation against the number of paral- 1 lel sentences in both Sinhala to Tamil and Tamil to WeusedtheUCSC 10MwordsSinhalaCorpus Sinhala translation. However, it clearly indicates (Weerasingheetal.,2007)andthe4MwordsTamil that much more data would be required to build Corpus (Weerasinghe et al., 2013) to build the an acceptable translation model for the Sinhala- Sinhala and Tamil language models respectively. Tamil language pair. The results of the Tamil to Both these are open domain corpora mainly with Sinhala translation system in figure 1 shows that newspaper articles and Technical writing. The the BLEUscoreapproaches12.9whenthedataset characteristics of the Sinhala and Tamil corpora is size reaches 25000. It also shows that results of showninTable1. the Sinhala to Tamil translation only approaches Finding a good large Sinhala-Tamil parallel 10.1 for the full dataset of 25000 parallel sen- corpus was the main difficulty. For this pur- tences. The figure 1 shows that when the dataset posewecollectedaSinhala-TamilParallelCorpus size is increased from 5000 to 10,000 and 10,000 1University of Colombo School of Computing to 20,000, the increase in performance varies by 130 14 12 ) 10 % ( e 8 r o c S 6 U E L B 4 SI-TA TA-SI 2 0 0 5000 10000 15000 20000 25000 30000 Number of Sentences Figure 1: Average BLEU Score VS Number of Parallel Sentences around 2 BLEU points for Sinhala to Tamil trans- • Calculate the number of total words(TotW) lation and around 2 to 3 BLEU points for Tamil and unique words(UniW) in each training to Sinhala translation. This is consistent with the (Tr) and test (Te) datasets. results reported by Turchi et al. (2012). • Calculate the number of out-of-vocabulary Sample Average Outof (OOV)wordsinthetestdataset(asapercent- Language Size (S) Perplexity Vocabulary OOV/S age of test dataset). (OOV) Sinhala 5000 1590.10 962 19% • Calculate the number of untranslated words 25000 997.33 2225 9% (UntransW)(as a percentage of test dataset). Tamil 5000 6067.65 1295 26% 25000 3819.94 3593 14% • Calculate the number of translated words Table 3: Average perplexity values and out-of- whicharenotinthereference dataset (Targe- vocabulary values of the Sinhala-Tamil Parallel tOOV)(as a percentage of test dataset). Corpus • Calculate the number of translated words Also, as shown in table 3, we can clearly see which are not in the target language model that as the number of sentences are increased, the (Target LM OOV) (as a percentage of refer- average perplexity for both Sinhala and Tamil de- ence dataset). creases. Sinhala and Tamil datasets were consid- ered separately from the parallel corpus to calcu- Description 5000 25000 late the perplexity values. These values are very TotW UniW TotW UniW Training Dataset 44,806 13,723 224,959 34,858 high compared to those of the dominant European Testing Dataset 4,985 2,884 24,678 8,890 languages. OOV(%) 19.70 33.29 9.41 25.11 Here we did an error analysis to identify the UntransW(%) 33.78 52.98 17.82 44.26 problems of the methods we used and to find new Reference Dataset 3,168 1,307 17,584 4,298 methodologies to improve the results. TargetOOV(%) 17.65 19.15 9.58 17.43 Target LM OOV(%) 0.29 0.33 1 1.55 4 Error Analysis Table 4: Results obtained from the error analysis of Sinhala to Tamil translation The BLEU scores for test sets of 5000 and 25000 datasamplesweretakenfortheerroranalysis. The The results obtained for the Sinhala to Tamil process for the error analysis stated as follows. and Tamil to Sinhala translations are shown in 131 5000 25000 language model since words in the Sinhala mono- Description TotW UniW TotW UniW Training Dataset 39,044 16,328 194,784 49,402 lingual corpus is more than twice as the words in Testing Dataset 4,336 2,968 21,462 10,381 the Tamil monolingual corpus. When consider the OOV(%) 30.32 43.67 16.84 33.85 Target OOV and Target LM OOV in Tamil to Sin- UntransW(%) 40.68 57.14 25.08 48.58 hala Translation is lower compared to the Sinhala Reference Dataset 3,168 1,307 17,584 4,298 to Tamil translation. That could be a another rea- TargetOOV(%) 10.88 14.94 5.01 11.45 son to get a higher BLEU score value for Tamil to Target LM OOV(%) 0.04 0.07 0.15 0.44 Sinhala translation. Table 5: Results obtained from the error analysis of Tamil to Sinhala translation 5 ConclusionandFutureWork table 4 and 5 respectively. When considering The purpose of this research was to find out how the 5000 and 25000 datasets in table 4 and 5, the SMTsystemsperformforSinhalatoTamiland we can see that the total number of words in TamiltoSinhalatranslation. We can conclude that the Tamil to Sinhala translation is lower than the while Tamil to Sinhala and Sinhala to Tamil trans- Sinhala to Tamil translation in both training and lation is unable to produce intelligible output with testing datasets. However the unique number of parallel corpus of just 25000 sentence pairs of rel- words in the Tamil to Sinhala translation is much atively short length, we can expect performance to higher than the Sinhala to Tamil translation. This approach usable levels by collecting a large par- clearly shows the complexity of the Tamil lan- allel corpora. Using this experience, we are cur- guage. However, as we expected OOV (unique rently collecting a more balanced parallel corpus. word) rate is reduced by 8% - 10%, when the However the error analysis shows that the sen- dataset size is increasing. That is one of the rea- tence length limitations of the Sinhala-Tamil par- sons for the increment of BLEU score value. We allel corpus could not be the only reason for the have identified mainly two problems. According comparatively lower BLEU scores, morphologi- to table 4, 20% of unique words in the test set cal richness may be the reason to get lower re- are not translated even they were in the training sults since misspelled words and proper names are set and 17% to 19% of words which are not in common to other languages too. Furthermore, a the target reference set is in the translated output. preliminarystudyshowsthatwecangetbetterper- Those are occurred due to phrase alignment prob- plexity values for the same dataset we used for this lems and also the decoding problems. For an ex- research by stemming suffixes of the Sinhala and ample if we need to translate ෙගදර (Home) to Tamil parallel sentences. In future, we are plan- Tamil, the phrase table consists only ෙගදර එන්න ningtoinvestigateandfindsolutionstotheseprob- (Come home) and ෙගදර යන්න (Go home), then lems and planning to implement a system capable that wordwillnotbetranslatedeventhatwordisin of producing acceptable translations between Sin- the training set. Since Sinhala and Tamil are low- hala and Tamil for use by the wider community. resourced languages, we need to consider these is- sues to build a good translation system. We can Acknowledgment clearly see that out-of-vocabulary rate and the un- translated word rate is much higher in Tamil to Sinhala Translation. Also when we consider the The authors would like to thank the anonymous out-of-vocabularywords,wehavefoundthatthose reviewers for their helpful comments and sugges- words consist of proper names, misspelled words, tions. This work was supported by the National inflections, derivatives and honorifics. These are Research Council, ICT Agency and LK Domain the main problems that we could identify from Registry of Sri Lanka. The authors are grate- the error analysis. Since human evaluation is very ful to past and current members of the Language costly, we used only the above technique to do the Technology Research Laboratory of the UCSC, evaluation. According to the figure 1, we can see Sri Lanka for their significant contribution in de- that even the OOV words are higher, BLEU score veloping the basic linguistic resources needed to values of Tamil to Sinhala translation is higher. carry out the research described above. The main reason for this could be the size of the 132
no reviews yet
Please Login to review.