131x Filetype PDF File size 0.18 MB Source: aclanthology.org
Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic WaelSalloum and NizarHabash Center for Computational Learning Systems Columbia University {wael,habash}@ccls.columbia.edu Abstract and phrase using a rule-based component that de- Modern Standard Arabic (MSA) has a wealth pends on the existence of a dialectal morphologi- of natural language processing (NLP) tools cal analyzer, a list of morphosyntactic transfer rules, and resources. In comparison, resources for and DA-MSAdictionaries. The resulting MSA is in dialectal Arabic(DA),theunstandardizedspo- alattice form that we pass to a language model for n- ken varieties of Arabic, are still lacking. We bestdecoding. TheoutputofELISSA,whetheratop- present ELISSA, a machine translation (MT) 1choicesentenceorn-bestsentences,ispassedtoan system for DA to MSA. ELISSA employs a MSA-English SMT system to produce the English rule-based approach that relies on morpho- translation sentence. ELISSA-based MSA-pivoting logical analysis, transfer rules and dictionar- for DA-to-EnglishSMTimprovesBLEUscores(Pa- ies in addition to language models to produce pineni et al., 2002) on three blind test sets between MSAparaphrases of DA sentences. ELISSA 0.6% and 1.4%. A manual error analysis of trans- can be employed as a general preprocessor for lated words shows that ELISSA produces correct DA when using MSA NLP tools. A man- ual error analysis of ELISSA’s output shows MSAtranslations over 93% of the time. that it produces correct MSA translations over The rest of this paper is structured as follows: 93% of the time. Using ELISSA to produce Section 2 motivates the use of ELISSA to improve MSA versions of DA sentences as part of DA-English SMT with an example. Section 3 dis- anMSA-pivotingDA-to-EnglishMTsolution, cusses some of the challenges associated with pro- improves BLEU scores on multiple blind test cessing Arabic and its dialects. Section 4 presents sets between 0.6% and 1.4%. related work. Section 5 details ELISSA and its 1 Introduction approach and Section 6 presents results evaluating MuchworkhasbeendoneonModernStandardAra- ELISSA under a variety of conditions. bic (MSA) natural language processing (NLP) and 2 Motivating Example machine translation (MT), especially Statistical MT (SMT). MSA has a wealth of resources in terms of Table 1 shows a motivating example of how pivot- morphological analyzers, disambiguation systems, ing on MSA can dramatically improve the transla- and parallel corpora. In comparison, research on di- tion quality of a statistical MT system that is trained alectal Arabic(DA),theunstandardizedspokenvari- on mostly MSA-to-English parallel corpora. In this eties of Arabic, is still lacking in NLP in general and example, we use Google Translate’s online Arabic- 1 MTinparticular. In this paper we present ELISSA, English SMT system. The table is divided into two our DA-to-MSA MT system, and show how it can parts. The top part shows a dialectal (Levantine) help improve the translation of highly dialectal Ara- sentence, its reference translation to English, and bic text into English by pivoting on MSA. its Google Translate translation. The Google Trans- The ELISSA approach can be summarized as fol- late translation clearly struggles with most of the DA lows. First, ELISSA uses different techniques to words, which were probably unseen in the training identify dialectal words and multi-word construc- data (i.e., out-of-vocabulary – OOV) and were con- tions (phrases) in a source sentence. Then, ELISSA 1The system was used on February 21, 2013. produces MSA paraphrases for the selected words 348 Proceedings of NAACL-HLT 2013, pages 348–358, c Atlanta, Georgia, 9–14 June 2013. 2013 Association for Computational Linguistics ß Q Ì DAsource hP AJÖ @ àñë gAÓ ñKB HAJJÓñ» áÊJªJK èAK àYK Bð ñªJK éJjË@ éj®Ë@ ¡Jj« ñËñJJºJk AÓ øAë éËAm 'AîE . . . . . . . YÊJËA« hðQK . bhAlHAl¯h hAy mA Hyktbwlw ςHyT AlSfHh AlšxSy¯h tbςw wlA bdn yAh ybςtln kwmyntAt lÂnw mAxbrhwn AymtA rHyrwHςAlbld. Human In this case, they will not write on his profile wall and they do not want him to send them comments because he Reference did not tell them when he will go to the country. Google Bhalhalh Hi Hictpoulo Ahat Profile Tbau not hull Weah Abatln Comintat Anu Mabarhun Oamta welcomed calls Translate them Aalbuld. Q ' Ì Human úæÓ Ñë m ÕË éKB HA®JʪK ÑêË ÉQK à@ éKðYKQK Bð éJjË@ éJj® ¡Ag « éË @ñJJºK áË éËAm '@ èYë ú¯ . úÎ . DA-to-MSA . YÊJË@ úÍ@ IëYJ . . fy hðh AlHAl¯h ln yktbwA lh ςlý HAyT SfHth AlšxSy¯h wlA yrydwnh Ân yrsl lhm tςlyqAt lÂnh lm yxbrhm mtý syðhb ˆ ˇ Alý Albld. Google In this case it would not write to him on the wall of his own and do not want to send their comments because he Translate did not tell them when going to the country. Table 1: A motivating example for DA-to-English MT by pivoting (bridging) on MSA. The top half of the table displays a DA sentence, its human reference translation and the output of Google Translate. The bottom half of the table shows the result of human translation into MSA of the DA sentence before sending it to Google Translate. sideredpropernouns(transliteratedandcapitalized). person’, one suffix àð- -wn ‘masculine plural’ and The lack of DA-English parallel corpora suggests one pronominal enclitic Aë+ +hA ‘it/her’. DAs dif- pivoting on MSA can improve the translation qual- fer from MSA phonologically, morphologically and ity. In the bottom part of the table, we show a hu- to a lesser degree syntactically. The morpholog- manMSAtranslationoftheDAsentenceaboveand ical differences are most noticeably expressed in its Google translation. We see that the results are the use of clitics and affixes that do not exist in quitepromising. ThegoalofELISSAistomodelthis MSA.Forinstance, the Levantine Arabic equivalent DA-MSAtranslation automatically. In Section 5.4, of the MSA example above is AëñJJºJkð w+H+y- we revisit this example to discuss ELISSA’s perfor- . ktb-w+hA ‘and they will write it’. The optionality manceonit. Weshowitsoutput and its correspond- of vocalic diacritics helps hide some of the differ- ing Google translation in Table 3. ences resulting from vowel changes; compare the 3 Challenges for Processing Arabic and its diacritized forms: Levantine wHayikitbuwhA and Dialects MSAwasayaktubuwnahA. All of the NLP challenges of MSA (e.g., optional Contemporary Arabic is in fact a collection of vari- diacritics and spelling inconsistency) are shared by eties: MSA, the official language of the Arab World, DA.However,thelackofstandardorthographiesfor which has a standard orthography and is used in the dialects and their numerous varieties pose new formal settings; and DAs, the commonly used in- challenges. Additionally, DAs are rather impover- formal native varieties, which have no standard or- ished in terms of available tools and resources com- thographies but have an increasing presence on the pared to MSA, e.g., there is very little parallel DA- web. Arabic, in general, is a morphologically com- English corpora and almost no MSA-DA parallel plex language which has rich inflectional morphol- corpora. The number and sophistication of morpho- ogy, expressed both templatically and affixationally, logical analysis and disambiguation tools in DA is and several classes of attachable clitics. For exam- verylimitedincomparisontoMSA(DuhandKirch- 2 hoff, 2005; Habash and Rambow,2006;AboBakret ple, the Arabic word AîEñJJºJð w+s+y-ktb-wn+hA . al., 2008; Habash, 2010; SalloumandHabash,2011; ‘and they will write it’ has two proclitics (+ð w+ Habash et al., 2012; Habash et al., 2013). MSA ‘and’ and + s+ ‘will’), one prefix -ø y- ‘3rd tools cannot be effectively used to handle DA, e.g., 2Arabic transliteration throughout the paper is presented in Habash and Rambow (2006) report that over one- theHabash-Soudi-Buckwalterscheme(Habashetal.,2007): (in third of Levantine verbs cannot be analyzed using ˇ alphabetical order) AbtθjHxdðrzsšSDTDςγfqklmnhwy and the an MSAmorphological analyzer. ˇ ¯ ˆ ˆ additional symbols: ’ Z,  @, A @, A @, wð', y Zø', ¯h è, ý ø. 349 4 Related Work paraphrases (Callison-Burch et al., 2006; Du et al., Dialectal Arabic NLP. Several researchers have 2010). Also related is the work by Nakov and Ng explored the idea of exploiting existing MSA rich (2011), whousemorphologicalknowledgetogener- resources to build tools for DA NLP (Chiang et al., ate paraphrases for a morphologically rich language, 2006). Such approaches typically expect the pres- Malay, to extend the phrase table in a Malay-to- ence of tools/resources to relate DA words to their English SMT system. MSA variants or translations. Given that DA and Pivoting on MSA or acquiring more DA-English MSA do not have much in terms of parallel cor- data? Zbibetal.(2012)demonstratedanapproach pora, rule-based methods to translate DA-to-MSA to cheaply obtaining DA-English data. They used or other methods to collect word-pair lists have been Amazon’sMechanicalTurk(MTurk)tocreateaDA- explored. For example, AboBakretal.(2008)intro- English parallel corpus of 1.5M words and added it duced a hybrid approach to transfer a sentence from to a 150MMSA-Englishparallelcorpustocreatethe EgyptianArabicintoMSA.Thishybridsystemcon- training corpus of their SMT system. They also used sisted of a statistical system for tokenizing and tag- MTurk to translate their dialectal test set to MSA ging, and a rule-based system for constructing dia- in order to compare the MSA-pivoting approach to critized MSAsentences. Moreover, Al-Sabbagh and the direct translation from DA to English approach. Girju (2010) described an approach of mining the They showed that even though pivoting on MSA web to build a DA-to-MSA lexicon. In the context (produced by Humantranslators in an oracle experi- of DA-to-English SMT, Riesa and Yarowsky (2006) ment)canreduceOOVrateto0.98%from2.27%for presented a supervised algorithm for online mor- direct translation (without pivoting), it improves by phemesegmentationonDAthatcuttheOOVwords 4.91% BLEU while direct translation improves by by half. 6.81%BLEUovertheir12.29%BLEUbaseline(di- Machine Translation for Closely Related Lan- rect translation using the 150M MSA system). They guages. Using closely related languages has been concluded that simple vocabulary coverage is not shown to improve MT quality when resources are sufficient and the domain mismatch is a more im- ˇ portant problem. The approach we take in this paper limited. Hajic et al. (2000) argued that for very is orthogonal to such efforts to build parallel data. close languages, e.g., Czech and Slovak, it is pos- Weplantostudyinteractions between the two types sible to obtain a better translation quality by using of solutions in the future. simple methods such as morphological disambigua- OurworkismostsimilartoSawaf(2010)’sMSA- tion, transfer-based MT and word-for-word MT. pivoting approach. In his approach, DA is normal- Zhang(1998)introducedaCantonese-MandarinMT ized into MSA using character-based DA normal- that uses transformational grammar rules. In the ization rules, a DA morphological analyzer, a DA context of Arabic dialect translation, Sawaf (2010) normalization decoder that relies on language mod- built a hybrid MT system that uses both statistical els, and a lexicon. Similarly, we use some char- and rule-based approaches for DA-to-English MT. acter normalization rules, a DA morphological an- In his approach, DA is normalized into MSA us- alyzer, and DA-MSA dictionaries. In contrast, we ing a dialectal morphological analyzer. In previ- use hand-written morphosyntactic transfer rules that ous work, we presented a rule-based DA-MSA sys- focus on translating DA morphemes and lemmas to tem to improve DA-to-English MT (Salloum and their MSA equivalents. Habash, 2011; Salloum and Habash, 2012). Our ap- In our previous work (Salloum and Habash, 2011; proachusedaDAmorphologicalanalyzer(ADAM) Salloum and Habash, 2012), we applied our ap- and a list of hand-written morphosyntactic transfer proach to tokenized Arabic and our DA-MSA trans- rules. This use of “resource-rich” related languages fer component used feature transfer rules only. We is a specific variant of the more general approach did not use a language model to pick the best path; of using pivot/bridge languages (Utiyama and Isa- instead we kept the ambiguity in the lattice and hara, 2007; Kumar et al., 2007). In the case of passed it to our SMT system. In contrast, in this pa- MSA and DA variants, it is plausible to consider per, we run ELISSA on untokenized Arabic, we use the MSA variants of a DA phrase as monolingual 350 feature, lemma, and surface form transfer rules, and Word-based selection. Methods of this type fall we pick the best path of the generated MSA lattice in the following categories: through a language model. a. User token-based selection: The user can mark Certain aspects of our approach are similar to specificwordsforselectionusingthetag‘/DIA’ Riesa and Yarowsky (2006)’s, in that we use mor- (stands for ‘dialect’) after each word to select. phological analysis for DA to help DA-English MT; b. Usertype-basedselection: Theusercanspecify but unlike them, we use a rule-based approach to a list of words to select from, e.g., OOVs. Also modelDAmorphology. the user can provide a list of words and their 5 ELISSA frequencies and specify a cut-off threshold to ELISSA is a DA-to-MSA MT System. ELISSA uses prevent selecting a frequent word. a rule-based approach (with some statistical compo- c. Morphology-based word selection: ELISSA nents) that relies on the existence of a DA morpho- uses ADAM (Salloum and Habash, 2011) logical analyzer, a list of hand-written transfer rules, to select words that have DA analyses only and DA-MSA dictionaries to create a mapping of (DIAONLY)orDA/MSAanalyses(DIAMSA). DA to MSA words and construct a lattice of pos- d. Dictionary-based selection: ELISSA selects sible sentences. ELISSA uses a language model to words based on their existence in the DA side rank and select the generated sentences. of our DA-MSAdictionaries. ELISSA supports untokenized (raw) input only. e. All: ELISSA selects every word in an input sen- ELISSAsupportsthreetypesofoutput: top-1choice, tence. an n-best list or a map file that maps source Phrase-based selection. This selection type uses words/phrases to target phrases. The top-1 and n- hand-written rules to identify dialectal multi-word best lists are determined using an untokenized MSA constructions that are mappable to single or multi- language model to rank the paths in the MSA trans- wordMSAconstructions. Thecurrentcountofthese lation output lattice. This variety of output types rules is 25. Table 2 presents some rule categories makesiteasytoplugELISSAwithothersystemsand and related examples. to use it as a DA preprocessing tool for other MSA In the current version of ELISSA, words can systems, e.g., MADA (Habash and Rambow, 2005) be selected using either the phrase-based selection or AMIRA(Diabetal.,2007). method or a word-based selection method, but not ELISSA’s approach consists of three major steps both. Phrase-based selection has precedence. We precededbyapreprocessingandnormalizationstep, evaluate different settings for selection step in Sec- that prepares the input text to be handled (e.g., UTF- tion 6. 8cleaning,Alif/Yanormalization,word-lengthening normalization), and followed by a post-processing 5.2 Translation step, that produces the output in the desired form In this step, ELISSA translates the selected words (e.g., encoding choice). The three major steps are and phrases to their MSA equivalent paraphrases. Selection, Translation, and Language Modeling. Thespecifictypeofselection determines the type of 5.1 Selection the translation, e.g., phrase-based selected words are In the first step, ELISSA identifies which words or translated using phrase-based translation rules. The phrases to paraphrase and which words or phrases MSAparaphrasesarethenusedtoformanMSAlat- to leave as is. ELISSA provides different methods tice. (techniques) for selection, and can be configured to Word-based translation. This category has two usedifferentsubsetsofthem. InSection6weusethe types of translation techniques: surface transla- term "selection mode" to denote a subset of selec- tion that uses DA-to-MSA surface-to-surface (S2S) tion methods. Selection methods are classified into transfer rules (TRs) and deep (morphological)trans- Word-based selection and Phrase-based selection. lation that uses the classic rule-based machine trans- lation flow: analysis, transfer and generation. The 351
no reviews yet
Please Login to review.