jagomart
digital resources
picture1_Arabic Pdf 100646 | W02 0509


 138x       Filetype PDF       File size 0.18 MB       Source: aclanthology.org


File: Arabic Pdf 100646 | W02 0509
a comprehensive nlp system for modern standard arabic and modern hebrew morphological analysis lemmatization vocalization disambiguation and text to speech dror kamir naama soreq yoni neeman melingo ltd melingo ltd ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
                          A Comprehensive NLP System for Modern Standard Arabic and 
                                                            Modern Hebrew 
                      Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech 
                                                                            
                               Dror Kamir                         Naama Soreq                         Yoni Neeman 
                               Melingo Ltd.                        Melingo Ltd.                        Melingo Ltd. 
                          16 Totseret Haaretz st.            16 Totseret Haaretz st.             16 Totseret Haaretz st. 
                              Tel-Aviv, Israel                   Tel-Aviv, Israel                    Tel-Aviv, Israel 
                        drork@melingo.com                  naamas@melingo.com                   yonin@melingo.com
                                                                                  
                       
                                         Abstract                         1 Introduction 
                     This paper presents a comprehensive NLP sys-         1.1    The common Semitic basis from an NLP 
                 tem by Melingo that has been recently developed                 standpoint 
                                                 TM
                 for Arabic, based on Morfix         – an operational         Modern Standard Arabic (MSA) and Modern 
                 formerly developed highly successful comprehen-          Hebrew (MH) share the basic Semitic traits: rich 
                 sive Hebrew NLP system.                                  morphology, based on consonantal roots (Jiðr / 
                     The system discussed includes modules for  Šoreš)1, which depends on vowel changes and in 
                 morphological analysis, context sensitive lemmati-       some cases consonantal insertions and deletions to 
                 zation, vocalization, text-to-phoneme conversion,                                           2
                                                                          create inflections and derivations.  
                 and syntactic-analysis-based prosody (intonation)            For example, in MSA: the consonantal root 
                 model. It is employed in applications such as full       /ktb/ combined with the vocalic pattern CaCaCa 
                 text search, information retrieval, text categoriza-     derives the verb kataba ‘to write’. This derivation 
                 tion, textual data mining, online contextual dic-        is further inflected into forms that indicate seman-
                 tionaries, filtering, and text-to-speech applications    tic features, such as number, gender, tense etc.: 
                 in the fields of telephony and accessibility and  katab-tu  ‘I wrote’, katab-ta  ‘you (sing. masc.) 
                 could serve as a handy accessory for non-fluent  wrote’, katab-ti ‘you (sing. fem.) wrote, ?a-ktubu 
                 Arabic or Hebrew speakers.                               ‘I write/will write’, etc. 
                     Modern Hebrew and Modern Standard Arabic                 Similarly in MH: the consonantal root /ktv/ 
                                                                          combined with the vocalic pattern CaCaC derives 
                 share some unique Semitic linguistic characteris-        the verb katav ‘to write’, and its inflections are: 
                 tics. Yet up to now, the two languages have been         katav-ti ‘I wrote’, katav-ta ‘you (sing. masc.) 
                 handled separately in Natural Language Processing                                                                  
                 circles, both on the academic and on the applica-        1 A remark about the notation: Phonetic transcriptions always 
                 tive levels. This paper reviews the major similari-        appear in Italics, and follow the IPA convention, except the 
                 ties and the minor dissimilarities between Modern          following: ? – glottal stop, ¿ – voiced pharyngeal fricative 
                 Hebrew and Modern Standard Arabic from the                 (‘Ayn),  đ – velarized d, ś – velarized s. Orthographic 
                 NLP standpoint, and emphasizes the benefit of de-          transliterations appear in curly brackets. Bound morphemes 
                                                                            (affixes, clitics, consonantal roots) are written between two 
                 veloping and maintaining a unified system for both         slashes. Arabic and Hebrew linguistic terms are written in 
                 languages.                                                 phonetic spelling beginning with a capital letter. The Arabic 
                                                                            term comes first. 
                                                                          2 For a review on the different approaches to Semitic inflec-
                                                                            tions see Beesley (2001), p. 2. 
                           wrote’, katav-t ‘you (sing. fem.) wrote’, e-xtov ‘I                                             The fact that MSA and MH morphology is 
                           will write’ etc.                                                                         root-based might promote the notion of identifying 
                                  In fact, morphological similarity extends much                                    the lemma with the root. But this solution is not 
                           further than this general observation, and includes                                      satisfactory: in most cases there is indeed a dia-
                           very specific similarities in terms of the NLP sys-                                      chronic relation in meaning among words and 
                           tems, such as usage of nominal forms to mark  forms of the same consonantal root. However, se-
                           tenses and moods of verbs; usage of pronominal  mantic shifts which occur over the years rule out 
                           enclitics to convey direct objects, and usage of  this method in synchronic analysis. Moreover, 
                           proclitics to convey some prepositions. Moreover,                                        some diachronic processes result in totally coinci-
                           the inflectional patterns and clitics are quite similar                                  dental “sharing” of a root by two or more com-
                           in form in most cases. Both languages exhibit con-                                       pletely different semantic domains. For example, 
                           struct formation (Iđa:fa / Smixut), which is similar                                     in MSA, the words fajr ‘dawn’ and infija:r ‘explo-
                           in its structure and in its role. The suffix marking                                     sion’ share the same root /fjr/ (the latter might have 
                           feminine gender is also similar, and similarity goes                                     originally been a metaphor). Similarly, in MH the 
                           as far as peculiarities in the numbering system,  verbs pasal ‘to ban, disqualify’ and pisel ‘to sculp-
                           where the female gender suffix marks the mascu-                                          ture’ share the same root /psl/ (the former is an old 
                           line. Some of these phenomena will be demon-                                             loan from Aramaic). 
                           strated below.                                                                                  In Morfix, as described below (2.1), a lemma 
                           1.2        Lemmatization of Semitic Languages                                            is defined not as the root, but as the manifestation 
                                                                                                                    of this root, most commonly as the lesser marked 
                                  A consistent definition of lemma is crucial for                                   form of a noun, adjective or verb. There is no es-
                           a data retrieval system. A lemma can be said to be                                       cape from some arbitrariness in the implementation 
                           the equivalent to a lexical entry: the basic gram-                                       of this definition, due to the fine line between in-
                           matical unit of natural language that is semanti-                                        flectional morphology and derivational morphol-
                           cally closed. In applications such as search  ogy. However, Morfix generally follows the 
                           engines, usually it is the lemma that is sought,  tradition set by dictionaries, especially bilingual 
                           while additional information including tense, num-                                       dictionaries. Thus, for example, difference in part 
                           ber, and person are dispensable.                                                         of speech entails different lemmas, even if the 
                                  In MSA and MH a lemma is actually the  morphological process is partially predictable. 
                           common denominator of a set of forms (hundreds                                           Similarly each verb pattern (Wazn / Binyan) is 
                           or thousands of forms in each set) that share the                                        treated as a different lemma.  
                           same meaning and some morphological and syn-                                                    Even so, the roots should not be overlooked, as 
                           tactic features. Thus, in MSA, the forms: ?awla:d,                                       they are a good basis for forming groups of lem-
                           walada:ni, despite their remarkable difference in                                        mas; in other words, the root can often serve as a 
                           appearance, share the same lemma WALAD ‘a boy’.                                          “super-lemma”, joining together several lemmas, 
                           This is even more noticeable in verbs, where forms                                       provided they all share a semantic field. 
                           like kataba, yaktubu, kutiba, yuktabu, kita:ba and                                       1.3        The Issue of Nominal Inflections of Verbs 
                           many more are all part of the same lemma: 
                           KATABA ‘to write’.                                                                              The inconclusive selection of lemmas in MSA 
                                  The rather large number of inflections and  and MH can be demonstrated by looking into an 
                           complex forms (forms that include clitics, see be-                                       interesting phenomenon: the nominal inflections of 
                           low 1.5) possible for each lemma results in a high                                       verbs (roughly parallel to the Latin participle, see 
                           total number of forms, which, in fact, is estimated                                      below). Since this issue is a good example both for 
                           to be the same for both languages: around 70 mil-                                        a characteristic of Semitic NLP and for the simi-
                                 3
                           lion . The mapping of these forms into lemmas is                                         larities between MSA and MH, it is worthwhile to 
                           inconclusive (See Dichy (2001), p. 24). Hence the                                        further elaborate on it. 
                           question rises: what should be defined as lemma in                                              Both MSA and MH use the nominal inflections 
                           MSA and MH.                                                                              of verbs to convey tenses, moods and aspects. 
                                                                                                                    These inflections are derived directly from the verb 
                           3 For Arabic - see Beesley (2001), p. 7 For Hebrew - our own                             according to strict rules, and their forms are pre-
                              sources. 
                  dictable in most cases. Nonetheless, grammati-                It is easy to see the additional difficulty that 
                  cally, these forms behave as nouns or adjectives.         this writing convention presents for NLP. The 
                  This means that they bear case marking in MSA,            string {yktb} in MSA can be interpreted as yak-
                  nominal marking for number and gender (in both            tubu (future tense), yaktuba (subjunctive), yaktub 
                  languages) and they can be definite or indefinite         (jussive), yuktabu (future tense passive) and even 
                  (in both languages). Moreover, these inflections  yuktibu ‘he dictates/will dictate’ a form that is con-
                  often serve as nouns or adjectives in their own  sidered by Morfix to be a different lemma alto-
                  right. This, in fact, causes the crucial problem for      gether (see above 1.2). Furthermore, ambiguity can 
                  data retrieval, since the system has to determine         occur between totally unrelated words, as will be 
                  whether the user refers to the noun/adjective or  shown in section 1.7. A trained MSA reader can 
                  rather to the verb for which it serves as inflection.     distinguish between these forms by using contex-
                      Nominal inflections of verbs exist in non-            tual cues (both syntactic and semantic). A similar 
                  Semitic languages as well; in most European lan-          contextual sensitivity must be programmed into the 
                  guages participles and infinitives have nominal  NLP system in order to meet this challenge. 
                  features. However, two Semitic traits make this               Each language also has some orthographic pe-
                  phenomenon more challenging in our case – the  culiarities of its own. The most striking in MH is 
                  rich morphology which creates a large set of in-          the multiple spelling conventions that are used si-
                  flections for each base form (i.e. the verb is in-        multaneously. The classical convention has been 
                  flected to create nominal forms and then each form        replaced in most texts with some kind of spelling 
                  is inflected again for case, gender and number).  system that partially indicates vowels, and thus 
                  Furthermore, Semitic languages allow nominal  reduces ambiguities. An NLP system has to take 
                  clauses, namely verbless sentences, which increase        into account the various spelling systems and the 
                  ambiguity. For example, in English it is easy to  fact that the classic convention is still occasionally 
                  recognize the form ‘drunk’ in ‘he has drunk’ as  used. Thus, each word often has more than one 
                  related to the lemma DRINK (V) (and not as an ad-         spelling. For example: the word shi?ur ‘a lesson’ 
                  jective). This is done by spotting the auxiliary ‘has’    can be written {š¿wr} or {šy¿wr}. The word kiven 
                  which precedes this form. However in MH, the  ‘to direct’ can be written {kwn} or {kywwn}, the 
                  clause  axi šomer could mean ‘my brother is a  former is the classical spelling (Ktiv Xaser) while 
                  guard’ or ‘my brother guards/is guarding’. The  the later is the standard semi-vocalized system 
                  syntactical cues for the final decision are subtle        (Ktiv Male), but a some non-standard spellings can 
                  and elusive. Similarly in MSA: axi ka:tibun could         also appear: {kywn}, {kwwn}. 
                  mean ‘my brother is writing’ or ‘my brother is a              MSA spelling is much more standardized and 
                  writer’.                                                  follows classic conventions. Nonetheless, some of 
                  1.4    Orthography                                        these conventions may seem confusing at first 
                                                                            sight. The Hamza sign, which represents the glottal 
                      From the viewpoint of NLP, especially com-            stop phoneme, can be written in 5 different ways, 
                  mercially applicable NLP, it is important to note         depending on its phonological environment. There-
                  that the writing systems of both MSA and MH fol-          fore, any change in vowels (very regular a phe-
                  low the same conventions, in which most vowels            nomenon in MSA inflectional paradigms) results in 
                  are not marked. Therefore, in MSA the form yak-           a different shape of Hamza. This occurs even when 
                  tubu ‘he writes/will write’ is written {yktb}. Simi-      the vowels themselves are not marked. Moreover – 
                  larly in MH, the form yilmad ‘he will learn’ is  there is often more than one shape possible per 
                  written {ylmd}. Both languages have a supplemen-          form, without any mandatory convention. One 
                  tary marking system for vocalization (written  could argue that all Hamza shapes should be en-
                  above, under and beside the text), but it is not used     coded as one for our purposes. This may solve 
                  in the overwhelming majority of texts. In both lan-       some problems, but then again it would deny us of 
                  guages, when vowels do appear as letters, letters of      crucial information about the vowels in the word. 
                  consonantal origin are used, consequently turning         Since the Hamza changes according to vowels 
                  these letters ambiguous (between their consonantal        around it, it is a good cue for retrieving the vocali-
                  and vocalic readings).                                    zation of the word, and to reduce ambiguity. 
                     1.5     Clitics and Complex Forms                                  proclitics must be taken into account in the lemma-
                          The phenomenon which will be described in  tization process. 
                     this section is related both to the morphological                  1.6     Syntax 
                     structure of MSA and MH, and to the orthographi-                        The syntactic structure of MSA and MH is 
                     cal conventions shared by these languages. Both 
                     languages use a diverse system of clitics4 that are                very similar. In fact, the list of major syntactic 
                     appended to the inflectional forms, creating com-                  rules is almost identical, though the actual applica-
                     plex forms and further complications in proper  tion of these rules may differ between the lan-
                     lemmatization and data retrieval.                                  guages. 
                          For example, in MSA, the form: ?awla:dun                           A good demonstration of that is the agreement 
                     ‘boys (nom.)’, a part of the lemma WALAD ‘boy’,                    rule. Both languages demand a strict noun-
                     can take the genitive pronominal enclitic /-ha/ ‘her’              adjective-verb agreement. The agreement includes 
                     and create the complex form: ?awla:d-u-ha ‘boys-                   features such as number, gender, definiteness and 
                     nom.-her (=her boys)’. This complex form is  in MSA also case marking (in noun-adjective 
                                                                                        agreement). The MH agreement rule is more 
                     orthographically represented as follows: 
                     {?wladha}. Similarly in Hebrew, the form yeladim                   straightforward than the MSA one. For example: 
                     ‘children’ (of the lemma YELED ‘child’), combined                  ha-yeladim ha-gdolim halxu ‘the-child-pl. the-big-
                     with the genitive pronominal enclitic /-ha/ ‘her’,                 pl. go-past-pl. (=The big children went). Note that 
                     yields the complex form yelade-ha ‘children-her  all elements in the sentence are marked as plural, 
                     (=her children)’. The orthographical representation                and the noun and the adjective also agree in defi-
                     is: {yldyh}.                                                       niteness. 
                          Enclitics usually denote genitive pronouns for                     The case of MSA is slightly different. MSA 
                     nouns (as demonstrated above) and accusative pro-                  has incomplete agreement in verb-subject sen-
                     nouns for verbs. For example, in MSA, ?akaltu-hu                   tences, which are the vast majority. In this case the 
                     ‘I ate it’ {?klth}, or in MH axalti-v  ‘I ate it’                  agreement of the verb will only be in gender but 
                     {?kltyw}. It is easy to see how this phenomenon,                   not in number, e.g. ðahaba l-?awla:du ‘go-past-
                     especially the orthographic convention which con-                  masc.-sing. boy-pl. (=The boys went)’. MSA also 
                     joins these enclitics to the basic form, may create                distinguishes between human plural forms and 
                     confusion in lemmatizing and data retrieval. How-                  non-human plural forms, i.e. if the plural form 
                     ever, the nature of clitics which limits their posi-               does not have a human referent, the verb or the 
                     tion and possible combinations helps to locate  adjective will be marked as feminine rather than 
                     them and trace the basic form from which the  plural, e.g. ðahabat el-kila:bu l-kabi:ratu ‘go-past-
                     complex one was created.                                           fem.-sing. the-dog-masc.-pl. the-big-fem.-sing. 
                          There are also several proclitics denoting  (=The big dogs went)’. 
                     prepositions and other particles, attached to the                       The example of the agreement rule demon-
                     preceding form by orthographic convention. The  strates both the similarities and the differences be-
                     most common are the conjunctions /w, f/, the  tween MSA and MH. Furthermore, it demonstrates 
                     prepositions /b, l, k/ and the definite article /al/ in            how minor are the differences as far as our pur-
                     MSA, and the conjunction /w/, the prepositions /b,                 poses go. As long as the agreement rule is taken 
                     k, l, m/ (often referred to as Otiyot Baxlam), the                 into account, its actual implementation has hardly 
                     relative pronoun /š/ and the definite article /h/ in               any consequences in the level of the system. This 
                     MH. Therefore, in MSA, the phrase: wa-li-l-                        example also demonstrates a very useful cue to 
                     ?wla:di ‘and to the boys’ will have the following                  reduce ambiguity among forms. This cue is proba-
                     orthographical representation: {wll?wlad}. In MH                   blyused intuitively by trained readers of MSA and 
                     the phrase ve-la-yeladim ‘and to the children’ will                MH, and encoding it into the Morfix NLP system 
                     be represented orthographically as: {wlyldym}.  turns out quite useful. 
                     Once again, when scanning a written text, these  1.7                       Ambiguity 
                                                                   
                                                                               
                     4 The term “clitics” is employed here as the closest term which         Perhaps the major challenge for NLP analysis 
                       can describe this phenomenon without committing to any           in MSA and MH is overcoming the ambiguity of 
                       linguistic theory. 
The words contained in this file might help you see if this file matches what you are looking for:

...A comprehensive nlp system for modern standard arabic and hebrew morphological analysis lemmatization vocalization disambiguation text to speech dror kamir naama soreq yoni neeman melingo ltd totseret haaretz st tel aviv israel drork com naamas yonin abstract introduction this paper presents sys the common semitic basis from an tem by that has been recently developed standpoint tm based on morfix operational msa formerly highly successful comprehen mh share basic traits rich sive morphology consonantal roots ji r discussed includes modules ore which depends vowel changes in context sensitive lemmati some cases insertions deletions zation phoneme conversion create inflections derivations syntactic prosody intonation example root model it is employed applications such as full ktb combined with vocalic pattern cacaca search information retrieval categoriza derives verb kataba write derivation tion textual data mining online contextual dic further inflected into forms indicate seman tionar...

no reviews yet
Please Login to review.