132x Filetype PDF File size 0.17 MB Source: www.cs.tau.ac.il
325 1 AnArabictoEnglish ExampleBasedTranslationSystem K.Bar,Y.Choueka,andN.Dershowitz additional morphology and partofspeech information. Our workisstillinprogress.Currently,thesystemfragmentsany newintroducedinput sentence and translates each fragment separately. Recombining those translations into a final coherentformisleftforfuturework. Our final goal is to develop an automated assistant for ! ArabictoEnglish machine translation systems that work within a rulebased or statistical paradigm, so as to better ! handle complicated cases and especially to improve the " I. INTRODUCTION HE examplebased (or “memorybased”) paradigm has T becomeafairlycommontechniquefornaturallanguage processing (NLP) and especially for machinetranslation applications,eversinceitwasfirstproposedbyNagaoin[1]. Fig.1.Mainstepsofexamplebasedtranslationsystem. Thatpaperexpressedthemainideabehindanexamplebased fluencyofthegeneratedtranslations. machinetranslation(EBMT)paradigm,namelytoemulatethe Thefollowingsectionisageneraldescriptionofoursystem. wayahumantranslatoroperatesinsomecases.Suchasystem In Section III, we give some experimental results using exploitsalargebilingualcorpustofindsimilarexamplesfor common automatic metrics. Conclusions are presented in fragmentsoftheinputsourcelanguage(Arabic,inourcase) SectionIV. text, and imitate its translations [2]. Searching for similar fragments is called . Given a group of matched II. SYSTEMDESCRIPTION fragments,thenextstepistoextractpossibletranslationsfrom the targetlanguage (English, in our case) version of the corpus. This is the step. The last step is Thetranslation examples we need were extracted from a ,whichisthegenerationofacompletetarget collection of parallel unvocalized ArabicEnglish documents language text, pasting together translated fragments. Fig. 1 takenfromtheUnitedNationsdocumentinventoryavailable outlinesanexamplebasedsystemforArabictoEnglish.The under the OfficialDocumentSystem (ODS) [4]. We reader may refer to the comprehensive survey of example automaticallyalignedeachparalleldocumentontheparagraph basedmachinetranslationsystemsbySomers[3]. level andeachparallelparagraphwastakenasatranslation Wedescribeanimplementationofthemajorcomponentsof example. These examples were morphologically analyzed anEBMTsystemthattranslatesshortModernStandardArabic using the wellknown Buckwalter morphological analyzer (MSA)sentencesintoEnglish.Itisanonstructuralsystem,so (version1.0)[5],andpartofspeechtaggedusingSVMPOS itstoresthetranslationexamplesastextualstrings,withsome [6],insuchawaythat,foreachword,weconsideredonlythe relevant Buckwalter analyses with the corresponding SVM POS'spartofspeechtag.Aspeciallookuptablethatmaps ManuscriptreceivedJanuary7,2007. Arabic words to their corresponding English words in each K. Bar, Dept. of ComputerScience,TelAvivUniversity,Ramat Aviv, parallelparagraphwasalsocreated.Actually,foreachArabic Israel(email:kfirbar@post.tau.ac.il). word in the translation example, we look up its English Y.Choueka,Dept.ofComputerScience,BarIlanUniversity,RamatGan, Israel(email:ycsarah@netvision.net.il). equivalentsinthelexiconandexpandthatwithsynonymsfrom N.Dershowitz,Dept.ofComputerScience,TelAvivUniversity,Ramat WordNet. Then we search the English version of the Aviv,Israel(email:nachumd@post.tau.ac.il). 325 2 translationexampleforallinstancesonthelemmaleveland levels,witheachlevelassignedadifferentscore.Text inserttheminthetable. (exact string) and stem matches credit the words with the TheArabicversionofthecorpuswasindexedonword,stem maximumpossible;alemmamatchcreditsthemwithlessand and lemma levels (stem and lemma as defined by the partofspeech credits the fragment matchscore with a Buckwalteranalyzer),so,foreachgivenword,weareableto minimalamount.TableIsummarizestheseveralmatchlevels retrievealltranslationexamplesthatcontainthatwordonany weusedinourexperiments. ofthethreelevels. Textandstemmatchreceivealmostthesamescoresince, currently, we do not yet handle the translation modification Givenanewinputsentence,thesystembeginsbysearching needed. When dealing with unvocalized text, there are, of the corpus for translation examples for which the Arabic course,complicatedsituationswhenbothwordshavethesame versionmatchesfragmentsoftheinputsentence.Amatched stembutdifferentlemmas,forexample,thewordsHIآ(, fragmentmustcontainatleasttwoadjacentwordsinthesame “wrote”) and HIآ (, “books”). Such cases are not yet inputsentence.Thesamefragmentcanbefoundinmorethan handled,sincewehavenotworkedwithacontextsensitive onetranslationexample.Therefore,aspecial is Arabiclemmatizerandsocannotderivethecorrectlemmaof assigned to each fragmenttranslation pair, representing the an Arabic word. Still, the combination of the Buckwalter quality of the matched fragment in the specific translation morphologicalanalyzerandtheSVMPOStaggerallowsusto example.Fragmentsarematchedwordbywordsothescore reducethenumberofpossiblelemmasforeveryArabicword for a fragmentistheaverageoftheindividualwordmatch soastoreducetheamountofambiguity.Actually,bylemma scores. match,wemeanthatwordsmatchonanyoneoftheirpossible Words are matched on , , , and lemmas.Thematchscoreinsuchacaseistheratiobetween thenumberofequallemmasandthetotalnumberoflemma TABLEI pairs (one per word). Further investigation, as well as WORDMATCHINGLEVELS developing and working with a context sensitive Arabic Match Description Match lemmatizer,isneededtobetterhandleallsuchsituations. Level Score Fragmentswithascorebelowsomepredefinedthresholdare Text Exactmatchofthewords. 1 discarded,sincepassinglowscorefragmentstothenextstep dramaticallyincreasestotalrunningtime.Notethatalarger Stem Wordsmatchintheirstemsbutnotintheir 0.9 corpus, with the concomitant increase in the number of surfaceform.Forinstance,thewords potentialfragments,wouldrequireraisingthethreshold. MNرPIQRSا($ % &,“theconstitutionality”) UINرPIQد($ % &&,“myconstitutional”) Fragments are stored in a structure comprising the sharethestemيرPIQد($ % &) following:(1) –fragment’sArabictext,taken from the input sentence; (2) – fragment’s Lemma Words share a lemma. For instance, the Dynamic Arabictext,takenfromthematchedtranslationexample;(3) followingwordsmatchintheirlemmas: score قرZ[( ',“apostate”) –theEnglishtranslationoftheexamplepattern;(4) قا\[( (',“apostates”) –ofthefragmentanditsexampletranslation. Notethatthestemsofthesewordsarenotthe Forefficiency,fragmentssharingthesameexamplepattern same. are collected and stored in a higherlevel, Content This level is planned but not yet 0.8 structure.(Notethatageneralfragmentconsistingofonlyone implemented.Theideaisthat,forexample, fragmentisalsopossible.) twolocationnameswouldgetahigherscore thantwodissimilarpropernouns. Theinputtothetransferstepconsistsofallthecollected Partof Words match only in their partofspeech. 0.3 Speech For instance, both are nouns. Actually, we generalfragmentsthatwerefoundinthematchingstep,andits requirethatbothalsohavethesametagsfor output is the translations of those generalfragments. The theiraffixes.Forexample,ifawordistagged translation of a generalfragment is taken to be the best asanounandhasthedefinitearticleprefixلا generated translation among the comprised fragments. (,“the”),thematchedwordmustagreeon bothfeatures–itmustbeanounandalso Translating a fragment is done in two main steps: (1) havethedefinitearticleprefix. extracting the translation of the example pattern from the English version of the translation example; (2) fixing the Common Thislevelisrelevantonlyforcommonwords 1 Word and affixes, taken from a predefined list. extracted translation so that it will be the translation of the Match Thesewords/affixesareorganizedingroups fragment’ssourcepattern. that representthesamemeaning.Clearly,a ! " # word/affix maybeamemberofmorethan one group.Words/affixesthataremembers Thefirststepistoextractthetranslationofthefragment’s ofthesamegrouparealsomatchedonthis examplepattern from the English version of the translation level. For example theprefixب(, “with”, example.Hereweusethepreparedlookuptableforevery “by”, “in”) is in the same group of the translationexamplewithinourcorpus.ForeveryArabicword prepositionUa(&,“in”). inthepattern,welookupitsEnglishequivalentsinthetable 325 3 and mark them in the English version of the translation feasible. example.Then,weextractthe Englishsegmentthat +! $#" containsthenumberofequivalencewords.Usually Recall that the match of a corpus fragment to the input a wordinsomeArabicexamplepatternhasseveralEnglish fragmentcanbeinexact:wordsmaybematchedonseveral equivalents, which makes the translation extraction process levels.Exactlymatchedwordsareassumedtohavethesame complicatedanderrorprone.Forthisreason,wealsorestrict translation, but stem or lemma matched words may require theratiobetweenthenumberofArabicwordsintheexample modifications(mostlyinflectionandprepositionsissues)tothe pattern and the number of English words in the extracted extractedtranslation.Theseissueswereleftforfuturework. translation,boundthembyafunctionoftheratiobetweenthe Wordsmatchedonthepartofspeechlevelrequirecomplete totalnumberofwordsintheArabicandEnglishversionsof changeofmeaning.Forexample,taketheinputfragmentrst[ thetranslationexample. u[mا (, , “the Security Council”), matched to the Forexample,takethefollowingtranslationexample: fragment u[mا MhSوvc[ ( -%& , “the security A:نZcdeاقPfgناRh[UaUifISانوZkISاوMNرZlIQmاتZ[RoSا responsibility”)insometranslationexample.Thewordsrst[ E:“Advisoryservicesandtechnicalcooperationinthefield (, , “council”)andMhSوvc[( -%&,“responsibility”)are ofhumanrights.” matchedonthepartofspeechlevel(botharenouns).Assume TableIIisthecorrespondinglookuptable.Now,supposethe thattheextractedtranslationfromthetranslationexampleis examplepatternisنZcdeاقPfgناRh[(&$)'%'* , “thesecurityresponsibility”,whichisactuallyatranslationof “the field of human rights”), so we want to extract its u[mا MhSوvc[ ( -%& , “the security responsibility”) translationfromtheEnglishversionofthetranslationexample. andisnotthetranslationoftheinputpatternatall.But,by Usingtheextractedlookup,wemarktheEnglishequivalences replacing the word “responsibility” from the translation of the pattern words in the translation example: “Advisory examplewiththetranslationofrst[(, ,“council”)fromthe services and technical cooperation in the $ of lexicon,wegetthecorrectphrase:“thesecuritycouncil”.The ”,andthenweextracttheshortestEnglishsegmentthat lexiconisimplementedusingtheglossariesextractedfromthe containsthenumberofequivalentwords,viz.“field Buckwalter morphological analyzer and expanded with ofhumanrights”. WordNetsynonymsaswasexplainedabove. TABLEII Sometimes the extracted translation contains some extra ALIGNMENTLOOKUPTABLE unnecessarywordsinthemiddle.Thosewordsappearmostly English Arabic because of the different structure of a nounphrase in both languages.Forexample,considertheexample,u[mاعPxP[ Services تZ[RoSا Uphsymا (%.% '&&),and its translation: “the Advisory MNرZlIQmا subjectofregionalsecurity”.Byextractingthetranslationof Cooperation نوZkISاو Technical UifISا the pattern u[mا عPxP[ (%.% ), we obtain: “the In Ua subjectofregionalsecurity”(sinceitistheshortestsegment Field ناRh[ that contains maximumwordalignments).Clearly,theword Rights قPfg “regional”isunnecessaryinthetranslationbecauseitisthe Human نZcdeا translationofthewordUphsymا('&&,“theregional”)that Thisisofcourseasimpleexample.Morecomplicatedones doesnotappearinthepattern.Sobyremovingthatwordfrom wouldhavemorethanoneequivalentforeachArabicword. thetranslationweobtainthecorrecttranslationofthepattern. Sometimes it is hard to find the corresponding English Theword“regional”appearsintheextractedtranslationdueto equivalentsforaspecificArabicword.Usuallythishappens the fact that Arabic adjectives come after the nouns they when the Arabic word is part of some phrase, whereas its qualify, which is the opposite of English syntax. Here, the translationdoesnotfollowwordforword,asin,forexample, nounphrase Uphsymا u[mا ( '&&, “the regional theArabicexamplepatternUpQر\hq(& !,meaning“not security”) is translated so that the translation of Uphsymا formal”. In many cases, we might find “informal” in the ('&&,“theregional”)appearsbeforethetranslationof English version instead. The problem is that neither the u[mا(AlAmn,security).Currently,identifyingsuchsituations synonymlistofthewordUpQر( &,“formal”),northelistof isdonebysearchingforthetranslationoftheword“regional” theword\hq(& ,“not”),containstheword“informal”.Such inafixednumberofArabicwordsthatcomeimmediatelyafter a situation is handled by a manually defined rule that is thepatterninthetranslationexample.However,thismethod triggered whenever the word \hq (& , “not”) appears. The is insufficient for more complex situations and is also very systemchecksthefollowingword,andinsteadofbuildinga timeconsuming.OurplanistoapplyanArabicchunkerto synonymlistbuildsanantonymlist,usingWordNet.Inthis extract the boundaries of the nounphrase and in that way example, the word “informal” appear as an antonym of the delimitingthesearcharea. word“formal”inWordNet. Removingunnecessarywordsfromtheextractedtranslation Therearemorecomplicatedstructuresthatarenothandled must preserve the correct English syntax of the remaining yet,butcapturingandwritingrulesforsuchcasesseemsquite translation,whichinsomecasesseemstobeadifficulttask. 325 4 Forthatpurpose,wehavecompiledseveralrulestodealwith TABLEIII differentsituations.Theserulesarebasedonthesyntaxofthe EVALUATIONRESULTS English extracted translation and identify cases that need BLEU NIST METEOR special care. First, we chunk the translation to discover its (4gram) basicnounphrases,usingtheBaseNP[7]chunker.Todothat, Besttranslationchosen 0.1849 4.1792 0.4851 we first apply Brill’s partofspeech tagger [8] to the bythesystem translation.Then,bylookingatthechunkedEnglishtext,we Besttranslationchosen 0.2488 5.1281 0.5363 byahumanreferee canascertaintheeffectofremovingtheunnecessaryword.In thepreviousexample,removingtheword“regional”fromthe text,“thesubjectofregionalsecurity”,maybedonewithout same, but on the best translation from the viewpoint of a anyfurthermodification,sincebytaggingandchunkingthe humanreferee.Inmostcases,thebesttranslationchosenby segmentweget the referee had a close (or even the same) finaltranslation [the/DTsubject/NN]of/IN[regional/JJsecurity/NN] scoreasthesystem’sbesttranslation. (thephrasesinbracketsarenounphrases)and“regional”is IV. CONCLUSION simplyanadjectivewithinanounphrase,whichstillhasthe We believe we have demonstrated the potential of the samehead.Prepositionsandotherfunctionwordsthatrelateto examplebased approach for Arabic, with only minimum thephrasearestillnecessary,sowekeepthem. investment in Arabic syntactical and linguistic issues. We As already mentioned, a generalfragment may contain foundthatmatchingfragmentsontheleveloflemmaandstem, several fragments sharing the same Arabic examplepattern. aswellaspartofspeech,enabledthesystemtobetterexploit Amongtheextractedtranslationsofthecomprisedfragments, thesmallnumberofexamplesinthecorpusweused.More which are all translations of the same Arabic pattern, we workisneededtoenlargeandenrichthecorpus,aswellasto choose the translation that covers the maximum number of formulaterulestodealwithvariousproblematicsituationsthat Arabic words to represent the generalfragment. The arenotyethandled.Thisallappearsquitefeasible.Finally,we calculatedforthechosentranslationisthe donotclaimthattheexamplebasedmethodissufficientto ratio between the number of covered words and the total handle the complete translation process. It seems that, for numberofwordsintheArabicpattern.The ofa Arabic,itshouldworktogetherwithsomekindofrulebased generalfragmentisthemultiplicationofitsmatchscoreand engine,aspartofamultienginesystem,soastobetterhandle itstranslationscore. morecomplicatedsituations. . / Intherecombinationstep,wepastetogethertheextracted REFERENCES translations to form a complete translation of the input [1] M.Nagao,“AFrameworkofMechanicalTranslationbetweenJapanese sentence.Thisisgenerallycomposedoftwosubtasks.Thefirst andEnglishbyAnalogyPrinciple”,InA.ElithornandR.Banerji,eds., is finding the 0 best recombinations of the extracted $)1.NorthHolland,1984. translationsthatcovertheentireinputsentence,andthesecond [2] S.Sato,andM.Nagao,“Towardmemorybasedtranslation,”23104 5,vol.3,pp.247252,1990. issmoothingouttherecombinedtranslationstomakeafully [3] H. L. Somers, “Review article: Examplebased machine translation”, grammaticalEnglishsentence.Currently,wehandleonlythe #$,pp.113157,1999. firstsubtask;thesecondisleftforfuturework.Bymultiplying [4] United Nations Official Document System (ODS), URL http://www.ods.un.org(viewedon29/11/06). the totalscores of the comprised generalfragments, we [5] T. Buckwalter, “Buckwalter Arabic Morphological Analyzer Version calculate a for each generated 1.0“.LinguisticDataConsortium,Philadelphia,2002.URLhttp://www recombination. The 0 best (where 0 is configurable) .ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49 (viewedon21/11/2006) recombinationsarereported. [6] M.Diab,K.HaciogluandD.Jurafsky,“AutomatictaggingofArabic text: from raw text to base phrase chunks”, The National Science III. EXPERIMENTALRESULTS Foundation,USA,2004. [7] L. Ramshaw and M. Marcus, “Text chunking using transformation Experimentswereconductedonacorpuscontaining13,500 based learning", In $ 5 $ 3 6 7 & translationexamples.Thefollowingresultsarebasedon400 3 ,MIT,1995. Arabicshortsentences(5.5wordspersentenceonaverage) [8] E.Brill,“Asimplerulebasedpartofspeechtagger”,In $ ./8$0 36 .pp.112116. thatweretakenfromunseendocumentsoftheUnitedNations MorganKauffman.SanMateo,California,1992. inventory. The ten best results were evaluated [9] K.Papineni,S.Roukos,T.WardandW.J.Zhu,“Bleu:amethodfor by some of the common automatic criteria for machine automatic evaluation of machine translation”, 1 $ 39:,pp.311318,Philadelphia,PA,July,2002. translationevaluation(BLEU[9],NIST,andMETEOR[10]), [10] S. Banerjee and A. Lavie, “Meteor: an automatic metric for MT althoughoursystemisstillunderconstruction.Also,weused evaluation with improved correlation with human judgments”, In only two different translation references for the evaluation. $ 3956 1 $ TableIII shows somepreliminaryexperimentalresults.The ; $< = ,pp.65 72,AnnArbor,MI,June,2005. firstrowcontainstheresultsofevaluatingthesystem’shighest rankedtranslationforeachinputsentence.Thesecondisthe
no reviews yet
Please Login to review.