Arabic Pdf 100646

Partial capture of text on file.

A Comprehensive NLP System for Modern Standard Arabic and
Modern Hebrew
Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech

Dror Kamir Naama Soreq Yoni Neeman
Melingo Ltd. Melingo Ltd. Melingo Ltd.
16 Totseret Haaretz st. 16 Totseret Haaretz st. 16 Totseret Haaretz st.
Tel-Aviv, Israel Tel-Aviv, Israel Tel-Aviv, Israel
drork@melingo.com naamas@melingo.com yonin@melingo.com

Abstract 1 Introduction
This paper presents a comprehensive NLP sys- 1.1 The common Semitic basis from an NLP
tem by Melingo that has been recently developed standpoint
TM
for Arabic, based on Morfix an operational Modern Standard Arabic (MSA) and Modern
formerly developed highly successful comprehen- Hebrew (MH) share the basic Semitic traits: rich
sive Hebrew NLP system. morphology, based on consonantal roots (Jiðr /
The system discussed includes modules for ore)1, which depends on vowel changes and in
morphological analysis, context sensitive lemmati- some cases consonantal insertions and deletions to
zation, vocalization, text-to-phoneme conversion, 2
create inflections and derivations.
and syntactic-analysis-based prosody (intonation) For example, in MSA: the consonantal root
model. It is employed in applications such as full /ktb/ combined with the vocalic pattern CaCaCa
text search, information retrieval, text categoriza- derives the verb kataba to write. This derivation
tion, textual data mining, online contextual dic- is further inflected into forms that indicate seman-
tionaries, filtering, and text-to-speech applications tic features, such as number, gender, tense etc.:
in the fields of telephony and accessibility and katab-tu I wrote, katab-ta you (sing. masc.)
could serve as a handy accessory for non-fluent wrote, katab-ti you (sing. fem.) wrote, ?a-ktubu
Arabic or Hebrew speakers. I write/will write, etc.
Modern Hebrew and Modern Standard Arabic Similarly in MH: the consonantal root /ktv/
combined with the vocalic pattern CaCaC derives
share some unique Semitic linguistic characteris- the verb katav to write, and its inflections are:
tics. Yet up to now, the two languages have been katav-ti I wrote, katav-ta you (sing. masc.)
handled separately in Natural Language Processing
circles, both on the academic and on the applica- 1 A remark about the notation: Phonetic transcriptions always
tive levels. This paper reviews the major similari- appear in Italics, and follow the IPA convention, except the
ties and the minor dissimilarities between Modern following: ? glottal stop, ¿ voiced pharyngeal fricative
Hebrew and Modern Standard Arabic from the (Ayn), đ velarized d, ś velarized s. Orthographic
NLP standpoint, and emphasizes the benefit of de- transliterations appear in curly brackets. Bound morphemes
(affixes, clitics, consonantal roots) are written between two
veloping and maintaining a unified system for both slashes. Arabic and Hebrew linguistic terms are written in
languages. phonetic spelling beginning with a capital letter. The Arabic
term comes first.
2 For a review on the different approaches to Semitic inflec-
tions see Beesley (2001), p. 2.
wrote, katav-t you (sing. fem.) wrote, e-xtov I The fact that MSA and MH morphology is
will write etc. root-based might promote the notion of identifying
In fact, morphological similarity extends much the lemma with the root. But this solution is not
further than this general observation, and includes satisfactory: in most cases there is indeed a dia-
very specific similarities in terms of the NLP sys- chronic relation in meaning among words and
tems, such as usage of nominal forms to mark forms of the same consonantal root. However, se-
tenses and moods of verbs; usage of pronominal mantic shifts which occur over the years rule out
enclitics to convey direct objects, and usage of this method in synchronic analysis. Moreover,
proclitics to convey some prepositions. Moreover, some diachronic processes result in totally coinci-
the inflectional patterns and clitics are quite similar dental sharing of a root by two or more com-
in form in most cases. Both languages exhibit con- pletely different semantic domains. For example,
struct formation (Iđa:fa / Smixut), which is similar in MSA, the words fajr dawn and infija:r explo-
in its structure and in its role. The suffix marking sion share the same root /fjr/ (the latter might have
feminine gender is also similar, and similarity goes originally been a metaphor). Similarly, in MH the
as far as peculiarities in the numbering system, verbs pasal to ban, disqualify and pisel to sculp-
where the female gender suffix marks the mascu- ture share the same root /psl/ (the former is an old
line. Some of these phenomena will be demon- loan from Aramaic).
strated below. In Morfix, as described below (2.1), a lemma
1.2 Lemmatization of Semitic Languages is defined not as the root, but as the manifestation
of this root, most commonly as the lesser marked
A consistent definition of lemma is crucial for form of a noun, adjective or verb. There is no es-
a data retrieval system. A lemma can be said to be cape from some arbitrariness in the implementation
the equivalent to a lexical entry: the basic gram- of this definition, due to the fine line between in-
matical unit of natural language that is semanti- flectional morphology and derivational morphol-
cally closed. In applications such as search ogy. However, Morfix generally follows the
engines, usually it is the lemma that is sought, tradition set by dictionaries, especially bilingual
while additional information including tense, num- dictionaries. Thus, for example, difference in part
ber, and person are dispensable. of speech entails different lemmas, even if the
In MSA and MH a lemma is actually the morphological process is partially predictable.
common denominator of a set of forms (hundreds Similarly each verb pattern (Wazn / Binyan) is
or thousands of forms in each set) that share the treated as a different lemma.
same meaning and some morphological and syn- Even so, the roots should not be overlooked, as
tactic features. Thus, in MSA, the forms: ?awla:d, they are a good basis for forming groups of lem-
walada:ni, despite their remarkable difference in mas; in other words, the root can often serve as a
appearance, share the same lemma WALAD a boy. super-lemma, joining together several lemmas,
This is even more noticeable in verbs, where forms provided they all share a semantic field.
like kataba, yaktubu, kutiba, yuktabu, kita:ba and 1.3 The Issue of Nominal Inflections of Verbs
many more are all part of the same lemma:
KATABA to write. The inconclusive selection of lemmas in MSA
The rather large number of inflections and and MH can be demonstrated by looking into an
complex forms (forms that include clitics, see be- interesting phenomenon: the nominal inflections of
low 1.5) possible for each lemma results in a high verbs (roughly parallel to the Latin participle, see
total number of forms, which, in fact, is estimated below). Since this issue is a good example both for
to be the same for both languages: around 70 mil- a characteristic of Semitic NLP and for the simi-
3
lion . The mapping of these forms into lemmas is larities between MSA and MH, it is worthwhile to
inconclusive (See Dichy (2001), p. 24). Hence the further elaborate on it.
question rises: what should be defined as lemma in Both MSA and MH use the nominal inflections
MSA and MH. of verbs to convey tenses, moods and aspects.
These inflections are derived directly from the verb
3 For Arabic - see Beesley (2001), p. 7 For Hebrew - our own according to strict rules, and their forms are pre-
sources.
dictable in most cases. Nonetheless, grammati- It is easy to see the additional difficulty that
cally, these forms behave as nouns or adjectives. this writing convention presents for NLP. The
This means that they bear case marking in MSA, string {yktb} in MSA can be interpreted as yak-
nominal marking for number and gender (in both tubu (future tense), yaktuba (subjunctive), yaktub
languages) and they can be definite or indefinite (jussive), yuktabu (future tense passive) and even
(in both languages). Moreover, these inflections yuktibu he dictates/will dictate a form that is con-
often serve as nouns or adjectives in their own sidered by Morfix to be a different lemma alto-
right. This, in fact, causes the crucial problem for gether (see above 1.2). Furthermore, ambiguity can
data retrieval, since the system has to determine occur between totally unrelated words, as will be
whether the user refers to the noun/adjective or shown in section 1.7. A trained MSA reader can
rather to the verb for which it serves as inflection. distinguish between these forms by using contex-
Nominal inflections of verbs exist in non- tual cues (both syntactic and semantic). A similar
Semitic languages as well; in most European lan- contextual sensitivity must be programmed into the
guages participles and infinitives have nominal NLP system in order to meet this challenge.
features. However, two Semitic traits make this Each language also has some orthographic pe-
phenomenon more challenging in our case the culiarities of its own. The most striking in MH is
rich morphology which creates a large set of in- the multiple spelling conventions that are used si-
flections for each base form (i.e. the verb is in- multaneously. The classical convention has been
flected to create nominal forms and then each form replaced in most texts with some kind of spelling
is inflected again for case, gender and number). system that partially indicates vowels, and thus
Furthermore, Semitic languages allow nominal reduces ambiguities. An NLP system has to take
clauses, namely verbless sentences, which increase into account the various spelling systems and the
ambiguity. For example, in English it is easy to fact that the classic convention is still occasionally
recognize the form drunk in he has drunk as used. Thus, each word often has more than one
related to the lemma DRINK (V) (and not as an ad- spelling. For example: the word shi?ur a lesson
jective). This is done by spotting the auxiliary has can be written {¿wr} or {y¿wr}. The word kiven
which precedes this form. However in MH, the to direct can be written {kwn} or {kywwn}, the
clause axi omer could mean my brother is a former is the classical spelling (Ktiv Xaser) while
guard or my brother guards/is guarding. The the later is the standard semi-vocalized system
syntactical cues for the final decision are subtle (Ktiv Male), but a some non-standard spellings can
and elusive. Similarly in MSA: axi ka:tibun could also appear: {kywn}, {kwwn}.
mean my brother is writing or my brother is a MSA spelling is much more standardized and
writer. follows classic conventions. Nonetheless, some of
1.4 Orthography these conventions may seem confusing at first
sight. The Hamza sign, which represents the glottal
From the viewpoint of NLP, especially com- stop phoneme, can be written in 5 different ways,
mercially applicable NLP, it is important to note depending on its phonological environment. There-
that the writing systems of both MSA and MH fol- fore, any change in vowels (very regular a phe-
low the same conventions, in which most vowels nomenon in MSA inflectional paradigms) results in
are not marked. Therefore, in MSA the form yak- a different shape of Hamza. This occurs even when
tubu he writes/will write is written {yktb}. Simi- the vowels themselves are not marked. Moreover
larly in MH, the form yilmad he will learn is there is often more than one shape possible per
written {ylmd}. Both languages have a supplemen- form, without any mandatory convention. One
tary marking system for vocalization (written could argue that all Hamza shapes should be en-
above, under and beside the text), but it is not used coded as one for our purposes. This may solve
in the overwhelming majority of texts. In both lan- some problems, but then again it would deny us of
guages, when vowels do appear as letters, letters of crucial information about the vowels in the word.
consonantal origin are used, consequently turning Since the Hamza changes according to vowels
these letters ambiguous (between their consonantal around it, it is a good cue for retrieving the vocali-
and vocalic readings). zation of the word, and to reduce ambiguity.
1.5 Clitics and Complex Forms proclitics must be taken into account in the lemma-
The phenomenon which will be described in tization process.
this section is related both to the morphological 1.6 Syntax
structure of MSA and MH, and to the orthographi- The syntactic structure of MSA and MH is
cal conventions shared by these languages. Both
languages use a diverse system of clitics4 that are very similar. In fact, the list of major syntactic
appended to the inflectional forms, creating com- rules is almost identical, though the actual applica-
plex forms and further complications in proper tion of these rules may differ between the lan-
lemmatization and data retrieval. guages.
For example, in MSA, the form: ?awla:dun A good demonstration of that is the agreement
boys (nom.), a part of the lemma WALAD boy, rule. Both languages demand a strict noun-
can take the genitive pronominal enclitic /-ha/ her adjective-verb agreement. The agreement includes
and create the complex form: ?awla:d-u-ha boys- features such as number, gender, definiteness and
nom.-her (=her boys). This complex form is in MSA also case marking (in noun-adjective
agreement). The MH agreement rule is more
orthographically represented as follows:
{?wladha}. Similarly in Hebrew, the form yeladim straightforward than the MSA one. For example:
children (of the lemma YELED child), combined ha-yeladim ha-gdolim halxu the-child-pl. the-big-
with the genitive pronominal enclitic /-ha/ her, pl. go-past-pl. (=The big children went). Note that
yields the complex form yelade-ha children-her all elements in the sentence are marked as plural,
(=her children). The orthographical representation and the noun and the adjective also agree in defi-
is: {yldyh}. niteness.
Enclitics usually denote genitive pronouns for The case of MSA is slightly different. MSA
nouns (as demonstrated above) and accusative pro- has incomplete agreement in verb-subject sen-
nouns for verbs. For example, in MSA, ?akaltu-hu tences, which are the vast majority. In this case the
I ate it {?klth}, or in MH axalti-v I ate it agreement of the verb will only be in gender but
{?kltyw}. It is easy to see how this phenomenon, not in number, e.g. ðahaba l-?awla:du go-past-
especially the orthographic convention which con- masc.-sing. boy-pl. (=The boys went). MSA also
joins these enclitics to the basic form, may create distinguishes between human plural forms and
confusion in lemmatizing and data retrieval. How- non-human plural forms, i.e. if the plural form
ever, the nature of clitics which limits their posi- does not have a human referent, the verb or the
tion and possible combinations helps to locate adjective will be marked as feminine rather than
them and trace the basic form from which the plural, e.g. ðahabat el-kila:bu l-kabi:ratu go-past-
complex one was created. fem.-sing. the-dog-masc.-pl. the-big-fem.-sing.
There are also several proclitics denoting (=The big dogs went).
prepositions and other particles, attached to the The example of the agreement rule demon-
preceding form by orthographic convention. The strates both the similarities and the differences be-
most common are the conjunctions /w, f/, the tween MSA and MH. Furthermore, it demonstrates
prepositions /b, l, k/ and the definite article /al/ in how minor are the differences as far as our pur-
MSA, and the conjunction /w/, the prepositions /b, poses go. As long as the agreement rule is taken
k, l, m/ (often referred to as Otiyot Baxlam), the into account, its actual implementation has hardly
relative pronoun // and the definite article /h/ in any consequences in the level of the system. This
MH. Therefore, in MSA, the phrase: wa-li-l- example also demonstrates a very useful cue to
?wla:di and to the boys will have the following reduce ambiguity among forms. This cue is proba-
orthographical representation: {wll?wlad}. In MH blyused intuitively by trained readers of MSA and
the phrase ve-la-yeladim and to the children will MH, and encoding it into the Morfix NLP system
be represented orthographically as: {wlyldym}. turns out quite useful.
Once again, when scanning a written text, these 1.7 Ambiguity

4 The term clitics is employed here as the closest term which Perhaps the major challenge for NLP analysis
can describe this phenomenon without committing to any in MSA and MH is overcoming the ambiguity of
linguistic theory.

The words contained in this file might help you see if this file matches what you are looking for:

...A comprehensive nlp system for modern standard arabic and hebrew morphological analysis lemmatization vocalization disambiguation text to speech dror kamir naama soreq yoni neeman melingo ltd totseret haaretz st tel aviv israel drork com naamas yonin abstract introduction this paper presents sys the common semitic basis from an tem by that has been recently developed standpoint tm based on morfix operational msa formerly highly successful comprehen mh share basic traits rich sive morphology consonantal roots ji r discussed includes modules ore which depends vowel changes in context sensitive lemmati some cases insertions deletions zation phoneme conversion create inflections derivations syntactic prosody intonation example root model it is employed applications such as full ktb combined with vocalic pattern cacaca search information retrieval categoriza derives verb kataba write derivation tion textual data mining online contextual dic further inflected into forms indicate seman tionar...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area