jagomart
digital resources
picture1_Language Pdf 102184 | Arabicnlp


 132x       Filetype PDF       File size 0.17 MB       Source: www.cs.tau.ac.il


File: Language Pdf 102184 | Arabicnlp
325 1 an arabic to english example based translation system k bar y choueka and n dershowitz additional morphology and part of speech information our work is still in progress ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
               325                                                                                                                                                     1
               
                                                              AnArabictoEnglish
                                         ExampleBasedTranslationSystem
                                                                  K.Bar,Y.Choueka,andN.Dershowitz
                                                                                            additional morphology and partofspeech information. Our
                   		

 
  

 	                workisstillinprogress.Currently,thesystemfragmentsany
               		

	



	
	
		              newintroducedinput sentence and translates each fragment
               

			
              separately. Recombining those translations into a final
               
	  		  		  	  
 	
		         coherentformisleftforfuturework.
               
	 
 		 

	    		 
                   Our final goal is to develop an automated assistant for
                
 !	 	
		 		 	  
	             ArabictoEnglish machine translation systems that work
               
  
	 

	 
	 	 
    		 	
               


	
	



            within a rulebased or statistical paradigm, so as to better
               	!	
			
              handle complicated cases and especially to improve the
                  	
					
	

	
	
               "	
		
               
                  	

			

               
                                         I. INTRODUCTION
                    HE examplebased (or “memorybased”) paradigm has
               T
                    becomeafairlycommontechniquefornaturallanguage
               processing (NLP) and especially for machinetranslation                                                                                              
               applications,eversinceitwasfirstproposedbyNagaoin[1].                 Fig.1.Mainstepsofexamplebasedtranslationsystem.
                                                                                                
               Thatpaperexpressedthemainideabehindanexamplebased                    fluencyofthegeneratedtranslations.
               machinetranslation(EBMT)paradigm,namelytoemulatethe                       Thefollowingsectionisageneraldescriptionofoursystem.
               wayahumantranslatoroperatesinsomecases.Suchasystem                  In Section III, we give some experimental results using
               exploitsalargebilingualcorpustofindsimilarexamplesfor                common automatic metrics. Conclusions are presented in
               fragmentsoftheinputsourcelanguage(Arabic,inourcase)                  SectionIV.
               text, and imitate its translations [2]. Searching for similar
               fragments is called . Given a group of matched                                      II. SYSTEMDESCRIPTION
               fragments,thenextstepistoextractpossibletranslationsfrom                		

		



               the targetlanguage (English, in our case) version of the
               corpus. This is the 	
	 step. The last step is                     Thetranslation examples we need were extracted from a
               	

,whichisthegenerationofacompletetarget                   collection of parallel unvocalized ArabicEnglish documents
               language text, pasting together translated fragments. Fig. 1           takenfromtheUnitedNationsdocumentinventoryavailable
               outlinesanexamplebasedsystemforArabictoEnglish.The                   under    the   OfficialDocumentSystem (ODS) [4]. We
               reader may refer to the comprehensive survey of example               automaticallyalignedeachparalleldocumentontheparagraph
               basedmachinetranslationsystemsbySomers[3].                              level andeachparallelparagraphwastakenasatranslation
                  Wedescribeanimplementationofthemajorcomponentsof                   example. These examples were morphologically analyzed
               anEBMTsystemthattranslatesshortModernStandardArabic                   using the wellknown Buckwalter morphological analyzer
               (MSA)sentencesintoEnglish.Itisanonstructuralsystem,so               (version1.0)[5],andpartofspeechtaggedusingSVMPOS
               itstoresthetranslationexamplesastextualstrings,withsome              [6],insuchawaythat,foreachword,weconsideredonlythe
                                                                                             relevant Buckwalter analyses with the corresponding SVM
                                                                                             POS'spartofspeechtag.Aspeciallookuptablethatmaps
                  ManuscriptreceivedJanuary7,2007.                                       Arabic words to their corresponding English words in each
                  K. Bar, Dept. of ComputerScience,TelAvivUniversity,Ramat Aviv,   parallelparagraphwasalsocreated.Actually,foreachArabic
               Israel(email:kfirbar@post.tau.ac.il).                                     word in the translation example, we look up its English
                  Y.Choueka,Dept.ofComputerScience,BarIlanUniversity,RamatGan,
               Israel(email:ycsarah@netvision.net.il).                                    equivalentsinthelexiconandexpandthatwithsynonymsfrom
                  N.Dershowitz,Dept.ofComputerScience,TelAvivUniversity,Ramat       WordNet. Then we search the English version of the
               Aviv,Israel(email:nachumd@post.tau.ac.il).
               325                                                                                                                                                       2
               
               translationexampleforallinstancesonthelemmaleveland                   
levels,witheachlevelassignedadifferentscore.Text
               inserttheminthetable.                                                      (exact string) and stem matches credit the words with the
                 TheArabicversionofthecorpuswasindexedonword,stem                   maximumpossible;alemmamatchcreditsthemwithlessand
               and lemma levels (stem and lemma as defined by the                    partofspeech credits the fragment matchscore with a
               Buckwalteranalyzer),so,foreachgivenword,weareableto                  minimalamount.TableIsummarizestheseveralmatchlevels
               retrievealltranslationexamplesthatcontainthatwordonany                weusedinourexperiments.
               ofthethreelevels.                                                              Textandstemmatchreceivealmostthesamescoresince,
                                                                                   currently, we do not yet handle the translation modification
                  Givenanewinputsentence,thesystembeginsbysearching                  needed. When dealing with unvocalized text, there are, of
               the corpus for translation examples for which the Arabic               course,complicatedsituationswhenbothwordshavethesame
               versionmatchesfragmentsoftheinputsentence.Amatched                     stembutdifferentlemmas,forexample,thewordsHIآ(,
               fragmentmustcontainatleasttwoadjacentwordsinthesame                  “wrote”) and HIآ (, “books”). Such cases are not yet
               inputsentence.Thesamefragmentcanbefoundinmorethan                    handled,sincewehavenotworkedwithacontextsensitive
               onetranslationexample.Therefore,aspecial

	is                   Arabiclemmatizerandsocannotderivethecorrectlemmaof
               assigned to each fragmenttranslation pair, representing the             an Arabic word. Still, the combination of the Buckwalter
               quality of the matched fragment in the specific translation            morphologicalanalyzerandtheSVMPOStaggerallowsusto
               example.Fragmentsarematchedwordbywordsothescore                       reducethenumberofpossiblelemmasforeveryArabicword
               for a fragmentistheaverageoftheindividualwordmatch                   soastoreducetheamountofambiguity.Actually,bylemma
               scores.                                                                       match,wemeanthatwordsmatchonanyoneoftheirpossible
                  Words are matched on , 
, , and 	
                 lemmas.Thematchscoreinsuchacaseistheratiobetween
                                                                                               thenumberofequallemmasandthetotalnumberoflemma
                                                 TABLEI                                      pairs (one per word). Further investigation, as well as
                                         WORDMATCHINGLEVELS                                 developing and working with a context sensitive Arabic
                   Match                      Description                      Match        lemmatizer,isneededtobetterhandleallsuchsituations.
                   Level                                                        Score           Fragmentswithascorebelowsomepredefinedthresholdare
                 Text        Exactmatchofthewords.                           1         discarded,sincepassinglowscorefragmentstothenextstep
                                                                                            dramaticallyincreasestotalrunningtime.Notethatalarger
                 Stem        Wordsmatchintheirstemsbutnotintheir        0.9        corpus, with the concomitant increase in the number of
                               surfaceform.Forinstance,thewords                          potentialfragments,wouldrequireraisingthethreshold.
                               MNرPIQRSا($
%	&,“theconstitutionality”)
                               UINرPIQد($
%	&&,“myconstitutional”)                          Fragments are stored in a structure comprising the
                               sharethestemيرPIQد($

%	&)                            following:(1)

		–fragment’sArabictext,taken
                                                                                            from the input sentence; (2)  	 – fragment’s
                 Lemma       Words share a lemma. For instance, the    Dynamic       Arabictext,takenfromthematchedtranslationexample;(3)
                               followingwordsmatchintheirlemmas:            score
                               قرZ[(	',“apostate”)                                      –theEnglishtranslationoftheexamplepattern;(4)
                               قا\[(	(',“apostates”)                                    

	–ofthefragmentanditsexampletranslation.
                               Notethatthestemsofthesewordsarenotthe                    Forefficiency,fragmentssharingthesameexamplepattern
                               same.                                                          are collected and stored in a higherlevel, 		
                                                                            
                 Content     This  level is  planned  but  not  yet       0.8        structure.(Notethatageneralfragmentconsistingofonlyone
                               implemented.Theideaisthat,forexample,                    fragmentisalsopossible.)
                               twolocationnameswouldgetahigherscore
                               thantwodissimilarpropernouns.                                 	
	
                                                                                               Theinputtothetransferstepconsistsofallthecollected
                 Partof      Words match only in their partofspeech.     0.3
                 Speech      For instance, both are nouns. Actually, we               generalfragmentsthatwerefoundinthematchingstep,andits
                               requirethatbothalsohavethesametagsfor                  output is the translations of those generalfragments. The
                               theiraffixes.Forexample,ifawordistagged                translation of a generalfragment is taken to be the best
                               asanounandhasthedefinitearticleprefixلا               generated translation among the comprised fragments.
                               (,“the”),thematchedwordmustagreeon
                               bothfeatures–itmustbeanounandalso                     Translating a fragment is done in two main steps: (1)
                               havethedefinitearticleprefix.                              extracting the translation of the example pattern from the
                                                                                            English version of the translation example; (2) fixing the
                 Common       Thislevelisrelevantonlyforcommonwords        1
                 Word         and affixes, taken from a predefined list.               extracted translation so that it will be the translation of the
                 Match        Thesewords/affixesareorganizedingroups                    fragment’ssourcepattern.
                               that representthesamemeaning.Clearly,a                         ! "	
#	

	

                               word/affix maybeamemberofmorethan
                               one group.Words/affixesthataremembers                        Thefirststepistoextractthetranslationofthefragment’s
                               ofthesamegrouparealsomatchedonthis                     examplepattern from the English version of the translation
                               level. For example theprefixب(, “with”,                example.Hereweusethepreparedlookuptableforevery
                               “by”, “in”) is in the same group of the                translationexamplewithinourcorpus.ForeveryArabicword
                               prepositionUa(&,“in”).
                                                                                              inthepattern,welookupitsEnglishequivalentsinthetable
                  
             325                                                                                                                                      3
             
             and mark them in the English version of the translation       feasible.
             example.Then,weextractthe

	
Englishsegmentthat               +! 
$#"	


             containsthenumberofequivalencewords.Usually                Recall that the match of a corpus fragment to the input
             a wordinsomeArabicexamplepatternhasseveralEnglish             fragmentcanbeinexact:wordsmaybematchedonseveral
             equivalents, which makes the translation extraction process      levels.Exactlymatchedwordsareassumedtohavethesame
             complicatedanderrorprone.Forthisreason,wealsorestrict         translation, but stem or lemma matched words may require
             theratiobetweenthenumberofArabicwordsintheexample            modifications(mostlyinflectionandprepositionsissues)tothe
             pattern and the number of English words in the extracted      extractedtranslation.Theseissueswereleftforfuturework.
             translation,boundthembyafunctionoftheratiobetweenthe         Wordsmatchedonthepartofspeechlevelrequirecomplete
             totalnumberofwordsintheArabicandEnglishversionsof            changeofmeaning.Forexample,taketheinputfragmentrst[
             thetranslationexample.                                              u[mا (,
 , “the Security Council”), matched to the
                Forexample,takethefollowingtranslationexample:                fragment u[mا MhSوvc[ (
-%& , “the security
                A:نZcdeاقPfgناRh[UaUifISانوZkISاوMNرZlIQmاتZ[RoSا           responsibility”)insometranslationexample.Thewordsrst[
                E:“Advisoryservicesandtechnicalcooperationinthefield        (,
, “council”)andMhSوvc[(
-%&,“responsibility”)are
                ofhumanrights.”                                                   matchedonthepartofspeechlevel(botharenouns).Assume
             TableIIisthecorrespondinglookuptable.Now,supposethe          thattheextractedtranslationfromthetranslationexampleis
             examplepatternisنZcdeاقPfgناRh[(&$)'%'*
,              “thesecurityresponsibility”,whichisactuallyatranslationof
             “the field of human rights”), so we want to extract its      u[mا MhSوvc[ (
-%& , “the security responsibility”)
             translationfromtheEnglishversionofthetranslationexample.       andisnotthetranslationoftheinputpatternatall.But,by
             Usingtheextractedlookup,wemarktheEnglishequivalences          replacing the word “responsibility” from the translation
             of the pattern words in the translation example: “Advisory     examplewiththetranslationofrst[(,
,“council”)fromthe
             services and technical cooperation in the $ of        lexicon,wegetthecorrectphrase:“thesecuritycouncil”.The
             	
”,andthenweextracttheshortestEnglishsegmentthat         lexiconisimplementedusingtheglossariesextractedfromthe
             containsthenumberofequivalentwords,viz.“field           Buckwalter morphological analyzer and expanded with
             ofhumanrights”.                                                     WordNetsynonymsaswasexplainedabove.
                                           TABLEII                                   Sometimes the extracted translation contains some extra
                                  ALIGNMENTLOOKUPTABLE                           unnecessarywordsinthemiddle.Thosewordsappearmostly
                           English                          Arabic                 because of the different structure of a nounphrase in both
                                                                                     languages.Forexample,considertheexample,u[mاعPxP[
               Services                         تZ[RoSا                            Uphsymا (%.% '&&),and its translation: “the
               Advisory                         MNرZlIQmا                          subjectofregionalsecurity”.Byextractingthetranslationof
               Cooperation                      نوZkISاو
               Technical                        UifISا                             the pattern u[mا عPxP[ (%.% ), we obtain: “the
               In                               Ua                                 subjectofregionalsecurity”(sinceitistheshortestsegment
               Field                            ناRh[                              that contains maximumwordalignments).Clearly,theword
               Rights                           قPfg                               “regional”isunnecessaryinthetranslationbecauseitisthe
               Human                            نZcdeا
                                                                                    translationofthewordUphsymا('&&,“theregional”)that
                Thisisofcourseasimpleexample.Morecomplicatedones           doesnotappearinthepattern.Sobyremovingthatwordfrom
             wouldhavemorethanoneequivalentforeachArabicword.              thetranslationweobtainthecorrecttranslationofthepattern.
                Sometimes it is hard to find the corresponding English      Theword“regional”appearsintheextractedtranslationdueto
             equivalentsforaspecificArabicword.Usuallythishappens           the fact that Arabic adjectives come after the nouns they
             when the Arabic word is part of some phrase, whereas its     qualify, which is the opposite of English syntax. Here, the
             translationdoesnotfollowwordforword,asin,forexample,         nounphrase Uphsymا u[mا ( '&&, “the regional
             theArabicexamplepatternUpQر\hq(&		
!,meaning“not            security”) is translated so that the translation of Uphsymا
             formal”. In many cases, we might find “informal” in the       ('&&,“theregional”)appearsbeforethetranslationof
             English version instead. The problem is that neither the       u[mا(AlAmn,security).Currently,identifyingsuchsituations
             synonymlistofthewordUpQر(	
&,“formal”),northelistof        isdonebysearchingforthetranslationoftheword“regional”
             theword\hq(&	,“not”),containstheword“informal”.Such         inafixednumberofArabicwordsthatcomeimmediatelyafter
             a situation is handled by a manually defined rule that is    thepatterninthetranslationexample.However,thismethod
             triggered whenever the word \hq (&	, “not”) appears. The      is insufficient for more complex situations and is also very
             systemchecksthefollowingword,andinsteadofbuildinga         timeconsuming.OurplanistoapplyanArabicchunkerto
             synonymlistbuildsanantonymlist,usingWordNet.Inthis         extract the boundaries of the nounphrase and in that way
             example, the word “informal” appear as an antonym of the      delimitingthesearcharea.
             word“formal”inWordNet.                                               Removingunnecessarywordsfromtheextractedtranslation
                Therearemorecomplicatedstructuresthatarenothandled          must preserve the correct English syntax of the remaining
             yet,butcapturingandwritingrulesforsuchcasesseemsquite        translation,whichinsomecasesseemstobeadifficulttask.
               325                                                                                                                                                             4
               
               Forthatpurpose,wehavecompiledseveralrulestodealwith                                                       TABLEIII
               differentsituations.Theserulesarebasedonthesyntaxofthe                                              EVALUATIONRESULTS
               English extracted translation and identify cases that need                                                 BLEU          NIST           METEOR
               special care. First, we chunk the translation to discover its                                            (4gram)
               basicnounphrases,usingtheBaseNP[7]chunker.Todothat,                       Besttranslationchosen       0.1849        4.1792           0.4851
               we first apply Brill’s partofspeech tagger [8] to the                     bythesystem
               translation.Then,bylookingatthechunkedEnglishtext,we                       Besttranslationchosen       0.2488        5.1281           0.5363
                                                                                                    byahumanreferee
               canascertaintheeffectofremovingtheunnecessaryword.In                                                             
               thepreviousexample,removingtheword“regional”fromthe                          
               text,“thesubjectofregionalsecurity”,maybedonewithout                     same, but on the best translation from the viewpoint of a
               anyfurthermodification,sincebytaggingandchunkingthe                       humanreferee.Inmostcases,thebesttranslationchosenby
               segmentweget                                                                    the referee had a close (or even the same) finaltranslation
                  [the/DTsubject/NN]of/IN[regional/JJsecurity/NN]                             scoreasthesystem’sbesttranslation.
               (thephrasesinbracketsarenounphrases)and“regional”is                                                 IV. CONCLUSION
               simplyanadjectivewithinanounphrase,whichstillhasthe                        We believe we have demonstrated the potential of the
               samehead.Prepositionsandotherfunctionwordsthatrelateto                   examplebased approach for Arabic, with only minimum
               thephrasearestillnecessary,sowekeepthem.                                  investment in Arabic syntactical and linguistic issues. We
                  As already mentioned, a generalfragment may contain                     foundthatmatchingfragmentsontheleveloflemmaandstem,
               several fragments sharing the same Arabic examplepattern.                  aswellaspartofspeech,enabledthesystemtobetterexploit
               Amongtheextractedtranslationsofthecomprisedfragments,                      thesmallnumberofexamplesinthecorpusweused.More
               which are all translations of the same Arabic pattern, we                workisneededtoenlargeandenrichthecorpus,aswellasto
               choose the translation that covers the maximum number of                  formulaterulestodealwithvariousproblematicsituationsthat
               Arabic words to represent the generalfragment. The                         arenotyethandled.Thisallappearsquitefeasible.Finally,we
               	



	calculatedforthechosentranslationisthe                    donotclaimthattheexamplebasedmethodissufficientto
               ratio between the number of covered words and the total                  handle the complete translation process. It seems that, for
               numberofwordsintheArabicpattern.The


	ofa                       Arabic,itshouldworktogetherwithsomekindofrulebased
               generalfragmentisthemultiplicationofitsmatchscoreand                     engine,aspartofamultienginesystem,soastobetterhandle
               itstranslationscore.                                                            morecomplicatedsituations.
                 . /


                  Intherecombinationstep,wepastetogethertheextracted                                                     REFERENCES
               translations to form a complete translation of the input                  [1]  M.Nagao,“AFrameworkofMechanicalTranslationbetweenJapanese
               sentence.Thisisgenerallycomposedoftwosubtasks.Thefirst                         andEnglishbyAnalogyPrinciple”,InA.ElithornandR.Banerji,eds.,
               is finding the 0 best recombinations of the extracted                           	$)1.NorthHolland,1984.
               translationsthatcovertheentireinputsentence,andthesecond                 [2]  S.Sato,andM.Nagao,“Towardmemorybasedtranslation,”23104
                                                                                                         5,vol.3,pp.247252,1990.
               issmoothingouttherecombinedtranslationstomakeafully                      [3]  H. L. Somers, “Review article: Examplebased machine translation”,
               grammaticalEnglishsentence.Currently,wehandleonlythe                             	

#$,pp.113157,1999.
               firstsubtask;thesecondisleftforfuturework.Bymultiplying                 [4]  United   Nations  Official  Document System (ODS), URL 
                                                                                                        http://www.ods.un.org(viewedon29/11/06).
               the totalscores of the comprised generalfragments, we                     [5]  T. Buckwalter, “Buckwalter Arabic Morphological Analyzer Version
               calculate     a  	



	        for   each generated               1.0“.LinguisticDataConsortium,Philadelphia,2002.URLhttp://www
               recombination. The 0 best (where 0 is configurable)                              .ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
                                                                                                        (viewedon21/11/2006)
               recombinationsarereported.                                                      [6]  M.Diab,K.HaciogluandD.Jurafsky,“AutomatictaggingofArabic
                                                                                                        text: from raw text to base phrase chunks”, The National Science
                                    III. EXPERIMENTALRESULTS                                         Foundation,USA,2004.
                                                                                                  [7] L. Ramshaw and M. Marcus, “Text chunking using transformation
               Experimentswereconductedonacorpuscontaining13,500                                based learning", In 	
$
 
  5	$ 3 6
	

 
 7	&
               translationexamples.Thefollowingresultsarebasedon400                            3	
	
	,MIT,1995.
               Arabicshortsentences(5.5wordspersentenceonaverage)                        [8]  E.Brill,“Asimplerulebasedpartofspeechtagger”,In	
$


                                                                                                         ./8$0	36
	

.pp.112116.
               thatweretakenfromunseendocumentsoftheUnitedNations                             MorganKauffman.SanMateo,California,1992.
               inventory.      The     ten     best    results     were     evaluated     [9]  K.Papineni,S.Roukos,T.WardandW.J.Zhu,“Bleu:amethodfor
               by some of the common automatic criteria for machine                            automatic evaluation of machine translation”, 1 	
$
 
 
                                                                                                        39:,pp.311318,Philadelphia,PA,July,2002.
               translationevaluation(BLEU[9],NIST,andMETEOR[10]),                         [10] S. Banerjee and A. Lavie, “Meteor: an automatic metric for MT
               althoughoursystemisstillunderconstruction.Also,weused                          evaluation with improved correlation with human judgments”, In
               only two different translation references for the evaluation.                    	
$

3956
	


1	
$
               TableIII shows somepreliminaryexperimentalresults.The                            	
;

	

	$<
		=
,pp.65
                                                                                                        72,AnnArbor,MI,June,2005.
               firstrowcontainstheresultsofevaluatingthesystem’shighest
               rankedtranslationforeachinputsentence.Thesecondisthe
The words contained in this file might help you see if this file matches what you are looking for:

...An arabic to english example based translation system k bar y choueka and n dershowitz additional morphology part of speech information our work is still in progress currently the fragments any new introduced input sentence translates each fragment separately recombining those translations into a final coherent form left for future goal develop automated assistant machine systems that within rule or statistical paradigm so as better handle complicated cases especially improve i introduction he memory has t become fairly common technique natural language processing nlp applications ever since it was first proposed by nagao fig main steps paper expressed idea behind fluency generated ebmt namely emulate following section general description way human translator operates some such iii we give experimental results using exploits large bilingual corpus find similar examples automatic metrics conclusions are presented source case iv text imitate its searching called given group matched ii ne...

no reviews yet
Please Login to review.