Language Pdf 101750

Partial capture of text on file.
                                         Dialectal Arabic to English Machine Translation:
                                              Pivoting through Modern Standard Arabic
                                                         WaelSalloum and NizarHabash
                                                    Center for Computational Learning Systems
                                                                   Columbia University
                                                   {wael,habash}@ccls.columbia.edu
                                           Abstract                              and phrase using a rule-based component that de-
                         Modern Standard Arabic (MSA) has a wealth               pends on the existence of a dialectal morphologi-
                         of natural language processing (NLP) tools              cal analyzer, a list of morphosyntactic transfer rules,
                         and resources. In comparison, resources for             and DA-MSAdictionaries. The resulting MSA is in
                         dialectal Arabic(DA),theunstandardizedspo-              alattice form that we pass to a language model for n-
                         ken varieties of Arabic, are still lacking. We          bestdecoding. TheoutputofELISSA,whetheratop-
                         present ELISSA, a machine translation (MT)              1choicesentenceorn-bestsentences,ispassedtoan
                         system for DA to MSA. ELISSA employs a                  MSA-English SMT system to produce the English
                         rule-based approach that relies on morpho-              translation sentence. ELISSA-based MSA-pivoting
                         logical analysis, transfer rules and dictionar-         for DA-to-EnglishSMTimprovesBLEUscores(Pa-
                         ies in addition to language models to produce           pineni et al., 2002) on three blind test sets between
                         MSAparaphrases of DA sentences. ELISSA                  0.6% and 1.4%. A manual error analysis of trans-
                         can be employed as a general preprocessor for           lated words shows that ELISSA produces correct
                         DA when using MSA NLP tools. A man-
                         ual error analysis of ELISSA’s output shows             MSAtranslations over 93% of the time.
                         that it produces correct MSA translations over             The rest of this paper is structured as follows:
                         93% of the time. Using ELISSA to produce                Section 2 motivates the use of ELISSA to improve
                         MSA versions of DA sentences as part of                 DA-English SMT with an example. Section 3 dis-
                         anMSA-pivotingDA-to-EnglishMTsolution,                  cusses some of the challenges associated with pro-
                         improves BLEU scores on multiple blind test             cessing Arabic and its dialects. Section 4 presents
                         sets between 0.6% and 1.4%.                             related work.      Section 5 details ELISSA and its
                   1    Introduction                                             approach and Section 6 presents results evaluating
                   MuchworkhasbeendoneonModernStandardAra-                       ELISSA under a variety of conditions.
                   bic (MSA) natural language processing (NLP) and               2    Motivating Example
                   machine translation (MT), especially Statistical MT
                   (SMT). MSA has a wealth of resources in terms of              Table 1 shows a motivating example of how pivot-
                   morphological analyzers, disambiguation systems,              ing on MSA can dramatically improve the transla-
                   and parallel corpora. In comparison, research on di-          tion quality of a statistical MT system that is trained
                   alectal Arabic(DA),theunstandardizedspokenvari-               on mostly MSA-to-English parallel corpora. In this
                   eties of Arabic, is still lacking in NLP in general and       example, we use Google Translate’s online Arabic-
                                                                                                          1
                   MTinparticular. In this paper we present ELISSA,              English SMT system. The table is divided into two
                   our DA-to-MSA MT system, and show how it can                  parts.   The top part shows a dialectal (Levantine)
                   help improve the translation of highly dialectal Ara-         sentence, its reference translation to English, and
                   bic text into English by pivoting on MSA.                     its Google Translate translation. The Google Trans-
                      The ELISSA approach can be summarized as fol-              late translation clearly struggles with most of the DA
                   lows.    First, ELISSA uses different techniques to           words, which were probably unseen in the training
                   identify dialectal words and multi-word construc-             data (i.e., out-of-vocabulary – OOV) and were con-
                   tions (phrases) in a source sentence. Then, ELISSA                1The system was used on February 21, 2013.
                   produces MSA paraphrases for the selected words
                                                                           348
                                                     Proceedings of NAACL-HLT 2013, pages 348–358,
                                                                       c
                                      Atlanta, Georgia, 9–14 June 2013. 
2013 Association for Computational Linguistics
                                                                                 ß
   	      Q 	          	 
       	           	                	                           	             	                                                    Ì
                                     DAsource                          hP AJÖ @ àñë gAÓ ñKB HAJJÓñ» áÊJªJK èAK àYK Bð ñªJK éJjË@ éj®Ë@ ¡Jj« ñËñJJºJk AÓ øAë éËAm 'AîE
                                                                                                .                        
               . 
    
        .             .     
                                
             .    
           
                 .
                                                                                                                                                                                                                                           . YÊJËA« hðQK
                                                                                                                                                                                                                                                .              

                                                              bhAlHAl¯h hAy mA Hyktbwlw ςHyT AlSfHh AlšxSy¯h tbςw wlA bdn yAh ybςtln kwmyntAt lÂnw mAxbrhwn AymtA
                                                              rHyrwHςAlbld.
                                     Human                    In this case, they will not write on his proﬁle wall and they do not want him to send them comments because he
                                     Reference                did not tell them when he will go to the country.
                                     Google                   Bhalhalh Hi Hictpoulo Ahat Proﬁle Tbau not hull Weah Abatln Comintat Anu Mabarhun Oamta welcomed calls
                                     Translate                them Aalbuld.                       
                                       

                                                                                    	           	                                   	      	                         	             	         
                               	                	       	
                                                                                 Q '                                                                                                                                                         Ì
                                     Human                           úæÓ Ñë m            ÕË   éKB HA®JÊªK ÑêË ÉQK à@ éKðYKQK Bð éJjË@ éJj® ¡Ag                                                           « éË @ñJJºK áË éËAm '@ èYë ú¯
                                                                                   .   
                      
                     
               
 
             
                                      úÎ              .    
                          	 

                                     DA-to-MSA                                                                                                                                                                                    . YÊJË@ úÍ@ IëYJ
                                                                                                                                                                                                                                       .        
    .      

                                                              fy hðh AlHAl¯h ln yktbwA lh ςlý HAyT SfHth AlšxSy¯h wlA yrydwnh Ân yrsl lhm tςlyqAt lÂnh lm yxbrhm mtý syðhb
                                                                                                                          ˆ
                                                               ˇ
                                                              Alý Albld.
                                     Google                   In this case it would not write to him on the wall of his own and do not want to send their comments because he
                                     Translate                did not tell them when going to the country.
                                   Table 1: A motivating example for DA-to-English MT by pivoting (bridging) on MSA. The top half of the table
                                   displays a DA sentence, its human reference translation and the output of Google Translate. The bottom half of the
                                   table shows the result of human translation into MSA of the DA sentence before sending it to Google Translate.
                                                                                                                                                                                                 	
                                   sideredpropernouns(transliteratedandcapitalized).                                                                  person’, one sufﬁx àð- -wn ‘masculine plural’ and
                                   The lack of DA-English parallel corpora suggests                                                                   one pronominal enclitic Aë+ +hA ‘it/her’. DAs dif-
                                   pivoting on MSA can improve the translation qual-                                                                  fer from MSA phonologically, morphologically and
                                   ity. In the bottom part of the table, we show a hu-                                                                to a lesser degree syntactically.                                        The morpholog-
                                   manMSAtranslationoftheDAsentenceaboveand                                                                           ical differences are most noticeably expressed in
                                   its Google translation. We see that the results are                                                                the use of clitics and afﬁxes that do not exist in
                                   quitepromising. ThegoalofELISSAistomodelthis                                                                       MSA.Forinstance, the Levantine Arabic equivalent
                                   DA-MSAtranslation automatically. In Section 5.4,                                                                                                                                               
                                                                                                                                                      of the MSA example above is AëñJJºJkð w+H+y-
                                   we revisit this example to discuss ELISSA’s perfor-                                                                                                                                          .      

                                                                                                                                                      ktb-w+hA ‘and they will write it’. The optionality
                                   manceonit. Weshowitsoutput and its correspond-                                                                     of vocalic diacritics helps hide some of the differ-
                                   ing Google translation in Table 3.                                                                                 ences resulting from vowel changes; compare the
                                   3       Challenges for Processing Arabic and its                                                                   diacritized forms: Levantine wHayikitbuwhA and
                                           Dialects                                                                                                   MSAwasayaktubuwnahA.
                                                                                                                                                           All of the NLP challenges of MSA (e.g., optional
                                   Contemporary Arabic is in fact a collection of vari-                                                               diacritics and spelling inconsistency) are shared by
                                   eties: MSA, the ofﬁcial language of the Arab World,                                                                DA.However,thelackofstandardorthographiesfor
                                   which has a standard orthography and is used in                                                                    the dialects and their numerous varieties pose new
                                   formal settings; and DAs, the commonly used in-                                                                    challenges. Additionally, DAs are rather impover-
                                   formal native varieties, which have no standard or-                                                                ished in terms of available tools and resources com-
                                   thographies but have an increasing presence on the                                                                 pared to MSA, e.g., there is very little parallel DA-
                                   web. Arabic, in general, is a morphologically com-                                                                 English corpora and almost no MSA-DA parallel
                                   plex language which has rich inﬂectional morphol-                                                                  corpora. The number and sophistication of morpho-
                                   ogy, expressed both templatically and afﬁxationally,                                                               logical analysis and disambiguation tools in DA is
                                   and several classes of attachable clitics. For exam-                                                               verylimitedincomparisontoMSA(DuhandKirch-
                                                                                  	                                                         2        hoff, 2005; Habash and Rambow,2006;AboBakret
                                   ple, the Arabic word AîEñJJºJð w+s+y-ktb-wn+hA
                                                                                       .     
                                                        al., 2008; Habash, 2010; SalloumandHabash,2011;
                                   ‘and they will write it’ has two proclitics (+ð w+                                                                 Habash et al., 2012; Habash et al., 2013). MSA
                                   ‘and’ and + s+ ‘will’), one preﬁx -ø y- ‘3rd
                                                                                                                           
                          tools cannot be effectively used to handle DA, e.g.,
                                         2Arabic transliteration throughout the paper is presented in                                                 Habash and Rambow (2006) report that over one-
                                   theHabash-Soudi-Buckwalterscheme(Habashetal.,2007): (in                                                            third of Levantine verbs cannot be analyzed using
                                                                                                       ˇ
                                   alphabetical order) AbtθjHxdðrzsšSDTDςγfqklmnhwy and the                                                           an MSAmorphological analyzer.
                                                                                
   ˇ      ¯  ˆ 
          ˆ               
                                   additional symbols: ’ Z, Â @, A @, A @, wð', y Zø', ¯h è, ý ø.
                                                                                        

                                                                                                                                          349
                 4    Related Work                                        paraphrases (Callison-Burch et al., 2006; Du et al.,
                 Dialectal Arabic NLP.       Several researchers have     2010). Also related is the work by Nakov and Ng
                 explored the idea of exploiting existing MSA rich        (2011), whousemorphologicalknowledgetogener-
                 resources to build tools for DA NLP (Chiang et al.,      ate paraphrases for a morphologically rich language,
                 2006). Such approaches typically expect the pres-        Malay, to extend the phrase table in a Malay-to-
                 ence of tools/resources to relate DA words to their      English SMT system.
                 MSA variants or translations. Given that DA and          Pivoting on MSA or acquiring more DA-English
                 MSA do not have much in terms of parallel cor-           data?    Zbibetal.(2012)demonstratedanapproach
                 pora, rule-based methods to translate DA-to-MSA          to cheaply obtaining DA-English data. They used
                 or other methods to collect word-pair lists have been    Amazon’sMechanicalTurk(MTurk)tocreateaDA-
                 explored. For example, AboBakretal.(2008)intro-          English parallel corpus of 1.5M words and added it
                 duced a hybrid approach to transfer a sentence from      to a 150MMSA-Englishparallelcorpustocreatethe
                 EgyptianArabicintoMSA.Thishybridsystemcon-               training corpus of their SMT system. They also used
                 sisted of a statistical system for tokenizing and tag-   MTurk to translate their dialectal test set to MSA
                 ging, and a rule-based system for constructing dia-      in order to compare the MSA-pivoting approach to
                 critized MSAsentences. Moreover, Al-Sabbagh and          the direct translation from DA to English approach.
                 Girju (2010) described an approach of mining the         They showed that even though pivoting on MSA
                 web to build a DA-to-MSA lexicon. In the context         (produced by Humantranslators in an oracle experi-
                 of DA-to-English SMT, Riesa and Yarowsky (2006)          ment)canreduceOOVrateto0.98%from2.27%for
                 presented a supervised algorithm for online mor-         direct translation (without pivoting), it improves by
                 phemesegmentationonDAthatcuttheOOVwords                  4.91% BLEU while direct translation improves by
                 by half.                                                 6.81%BLEUovertheir12.29%BLEUbaseline(di-
                 Machine Translation for Closely Related Lan-             rect translation using the 150M MSA system). They
                 guages.    Using closely related languages has been      concluded that simple vocabulary coverage is not
                 shown to improve MT quality when resources are           sufﬁcient and the domain mismatch is a more im-
                                ˇ                                         portant problem. The approach we take in this paper
                 limited.   Hajic et al. (2000) argued that for very      is orthogonal to such efforts to build parallel data.
                 close languages, e.g., Czech and Slovak, it is pos-      Weplantostudyinteractions between the two types
                 sible to obtain a better translation quality by using    of solutions in the future.
                 simple methods such as morphological disambigua-            OurworkismostsimilartoSawaf(2010)’sMSA-
                 tion, transfer-based MT and word-for-word MT.            pivoting approach. In his approach, DA is normal-
                 Zhang(1998)introducedaCantonese-MandarinMT               ized into MSA using character-based DA normal-
                 that uses transformational grammar rules.      In the    ization rules, a DA morphological analyzer, a DA
                 context of Arabic dialect translation, Sawaf (2010)      normalization decoder that relies on language mod-
                 built a hybrid MT system that uses both statistical      els, and a lexicon. Similarly, we use some char-
                 and rule-based approaches for DA-to-English MT.          acter normalization rules, a DA morphological an-
                 In his approach, DA is normalized into MSA us-           alyzer, and DA-MSA dictionaries. In contrast, we
                 ing a dialectal morphological analyzer.     In previ-    use hand-written morphosyntactic transfer rules that
                 ous work, we presented a rule-based DA-MSA sys-          focus on translating DA morphemes and lemmas to
                 tem to improve DA-to-English MT (Salloum and             their MSA equivalents.
                 Habash, 2011; Salloum and Habash, 2012). Our ap-            In our previous work (Salloum and Habash, 2011;
                 proachusedaDAmorphologicalanalyzer(ADAM)                 Salloum and Habash, 2012), we applied our ap-
                 and a list of hand-written morphosyntactic transfer      proach to tokenized Arabic and our DA-MSA trans-
                 rules. This use of “resource-rich” related languages     fer component used feature transfer rules only. We
                 is a speciﬁc variant of the more general approach        did not use a language model to pick the best path;
                 of using pivot/bridge languages (Utiyama and Isa-        instead we kept the ambiguity in the lattice and
                 hara, 2007; Kumar et al., 2007).       In the case of    passed it to our SMT system. In contrast, in this pa-
                 MSA and DA variants, it is plausible to consider         per, we run ELISSA on untokenized Arabic, we use
                 the MSA variants of a DA phrase as monolingual
                                                                    350
                 feature, lemma, and surface form transfer rules, and    Word-based selection.     Methods of this type fall
                 we pick the best path of the generated MSA lattice      in the following categories:
                 through a language model.                                 a. User token-based selection: The user can mark
                    Certain aspects of our approach are similar to            speciﬁcwordsforselectionusingthetag‘/DIA’
                 Riesa and Yarowsky (2006)’s, in that we use mor-             (stands for ‘dialect’) after each word to select.
                 phological analysis for DA to help DA-English MT;         b. Usertype-basedselection: Theusercanspecify
                 but unlike them, we use a rule-based approach to             a list of words to select from, e.g., OOVs. Also
                 modelDAmorphology.                                           the user can provide a list of words and their
                 5   ELISSA                                                   frequencies and specify a cut-off threshold to
                 ELISSA is a DA-to-MSA MT System. ELISSA uses                 prevent selecting a frequent word.
                 a rule-based approach (with some statistical compo-       c. Morphology-based word selection:       ELISSA
                 nents) that relies on the existence of a DA morpho-          uses ADAM (Salloum and Habash, 2011)
                 logical analyzer, a list of hand-written transfer rules,     to select words that have DA analyses only
                 and DA-MSA dictionaries to create a mapping of               (DIAONLY)orDA/MSAanalyses(DIAMSA).
                 DA to MSA words and construct a lattice of pos-           d. Dictionary-based selection:    ELISSA selects
                 sible sentences. ELISSA uses a language model to             words based on their existence in the DA side
                 rank and select the generated sentences.                     of our DA-MSAdictionaries.
                    ELISSA supports untokenized (raw) input only.          e. All: ELISSA selects every word in an input sen-
                 ELISSAsupportsthreetypesofoutput: top-1choice,               tence.
                 an n-best list or a map ﬁle that maps source            Phrase-based selection.    This selection type uses
                 words/phrases to target phrases. The top-1 and n-       hand-written rules to identify dialectal multi-word
                 best lists are determined using an untokenized MSA      constructions that are mappable to single or multi-
                 language model to rank the paths in the MSA trans-      wordMSAconstructions. Thecurrentcountofthese
                 lation output lattice.  This variety of output types    rules is 25. Table 2 presents some rule categories
                 makesiteasytoplugELISSAwithothersystemsand              and related examples.
                 to use it as a DA preprocessing tool for other MSA        In the current version of ELISSA, words can
                 systems, e.g., MADA (Habash and Rambow, 2005)           be selected using either the phrase-based selection
                 or AMIRA(Diabetal.,2007).                               method or a word-based selection method, but not
                    ELISSA’s approach consists of three major steps      both. Phrase-based selection has precedence. We
                 precededbyapreprocessingandnormalizationstep,           evaluate different settings for selection step in Sec-
                 that prepares the input text to be handled (e.g., UTF-  tion 6.
                 8cleaning,Alif/Yanormalization,word-lengthening
                 normalization), and followed by a post-processing       5.2  Translation
                 step, that produces the output in the desired form      In this step, ELISSA translates the selected words
                 (e.g., encoding choice). The three major steps are      and phrases to their MSA equivalent paraphrases.
                 Selection, Translation, and Language Modeling.          Thespeciﬁctypeofselection determines the type of
                 5.1   Selection                                         the translation, e.g., phrase-based selected words are
                 In the ﬁrst step, ELISSA identiﬁes which words or       translated using phrase-based translation rules. The
                 phrases to paraphrase and which words or phrases        MSAparaphrasesarethenusedtoformanMSAlat-
                 to leave as is. ELISSA provides different methods       tice.
                 (techniques) for selection, and can be conﬁgured to     Word-based translation.      This category has two
                 usedifferentsubsetsofthem. InSection6weusethe           types of translation techniques:   surface transla-
                 term "selection mode" to denote a subset of selec-      tion that uses DA-to-MSA surface-to-surface (S2S)
                 tion methods. Selection methods are classiﬁed into      transfer rules (TRs) and deep (morphological)trans-
                 Word-based selection and Phrase-based selection.        lation that uses the classic rule-based machine trans-
                                                                         lation ﬂow: analysis, transfer and generation. The
                                                                   351
The words contained in this file might help you see if this file matches what you are looking for:

...Dialectal arabic to english machine translation pivoting through modern standard waelsalloum and nizarhabash center for computational learning systems columbia university wael habash ccls edu abstract phrase using a rule based component that de msa has wealth pends on the existence of morphologi natural language processing nlp tools cal analyzer list morphosyntactic transfer rules resources in comparison da msadictionaries resulting is theunstandardizedspo alattice form we pass model n ken varieties are still lacking bestdecoding theoutputofelissa whetheratop present elissa mt choicesentenceorn bestsentences ispassedtoan system employs smt produce approach relies morpho sentence logical analysis dictionar englishsmtimprovesbleuscores pa ies addition models pineni et al three blind test sets between msaparaphrases sentences manual error trans can be employed as general preprocessor lated words shows produces correct when man ual s output msatranslations over time it translations rest th...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area