131x Filetype PDF File size 0.08 MB Source: groups.inf.ed.ac.uk
AutomaticInduction of a CCG GrammarforTurkish RukenC¸akıcı Institute for Communicating and Collaborative Systems University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW United Kingdom r.cakici@sms.ed.ac.uk Abstract The rest of this section contains an overview of This paper presents the results of auto- the underlying formalism (1.1). This is followed by matically inducing a Combinatory Cate- a review of the relevant work (1.2). In Section 2, the gorial Grammar (CCG) lexicon from a properties of the data are explained. Section 3 then Turkish dependency treebank. The fact gives a brief sketch of the algorithm used to induce that Turkish is an agglutinating free word- a CCGlexicon, with some examples of how certain order language presents a challenge for phenomena in Turkish are handled. As is likely to language theories. We explored possible be the case for most languages for the foreseeable ways to obtain a compact lexicon, consis- future, the Turkish treebank is quite small (less than tent with CCG principles, from a treebank 60K words). A major emphasis in the project is on which is an order of magnitude smaller generalising the induced lexicon to improve cover- than Penn WSJ. age. Results and future work are discussed in the last two sections. 1 Introduction 1.1 CombinatoryCategorial Grammar Turkish is an agglutinating language, a single word Combinatory Categorial Grammar (Ades and Steed- can be a sentence with tense, modality, polarity, and man, 1982; Steedman, 2000) is an extension to voice. It has free word-order, subject to discourse the classical Categorial Grammar (CG) of Aj- restrictions. All these properties make it a challenge dukiewicz (1935) and Bar-Hillel (1953). CG, and to language theories like CCG (Steedman (2000)). extensions to it, are lexicalist approaches which Several studies have been made into building a deny the need for movement or deletion rules in CCGforTurkish (Bozs¸ahin, 2002; Hoffman, 1995). syntax. Transparent composition of syntactic struc- Bozs¸ahin builds a morphemic lexicon to model the tures and semantic interpretations, and flexible con- phrasal scope of the morphemes which cannot beac- stituency make CCGapreferred formalism for long- quired with classical lexemic approach. He handles range dependencies and non-constituent coordina- scrambling withtype raising and composition. Hoff- tion in many languages e.g. English, Turkish, man proposes a generalisation of CCG (Multiset- Japanese, Irish, Dutch, Tagalog (Steedman, 2000; CCG) for argument scrambling. She underspeci- Baldridge, 2002). fies the directionality, which results in an undesir- The categories in categorial grammars can be able increase in the generative power of the gram- atomic, or functions which specify the directional- ity of their arguments. A lexical item in a CG can be mar. However, Baldridge (2002) gives a more re- represented as the triplet: ✂✁☎✄✝✆✞✁✠✟ where is the strictive form of free order CCG. Both Hoffman and phonological form, ✆ is its syntactic type, and ✟ its Baldridge ignore morphology and treat the inflected semantic type. Some examples are: forms as different words. (1) a. ✂✁✄✁✆☎ ✁☎✄✞✝ ✁✠✟☛✡☞✡✍✌ only represents surface dependencies. For example, b. ✁✆☎✏✎ ✁☎✄✒✑✔✓✖✕✗✝✙✘✛✚✜✕✗✝✢✘ ✁✠✟✤✣✦✥☎✟✤✧★✥✪✩✬✫✮✭✰✯✱✣★✧ the head is always known, because dependency links In classical CG, there are two kinds of application are from dependant to head. However, some prob- rules, which are presented below: lems are caused by that fact that only surface depen- dencies are included. These are discussed in Sec- (2) Forward Application (✲ ): tion 3.5. ✳✵✴✏✶ ✁✸✷ ✶ ✁✺✹ ✻ ✳ ✁✸✷✼✹ 2 Data Backward Application (✽ ): ✶ ✁✺✹ ✳ ✕ ✶ ✁✸✷ ✻ ✳ ✁✸✷✼✹ The METU-Sabancı Treebank is a subcorpus of the METUTurkish Corpus (Atalay et al., 2003; Oflazer In addition to functional application rules, CCG et al., 2003). The samples in the corpus are taken has combinatory operators for composition (B), type from 3 daily newspapers, 87 journal issues and 201 raising (T), and substitution (S).1 These opera- books. The treebank has 5635 sentences.There are a tors increase the expressiveness to mildly context- total of 53993 tokens. The average sentence length sensitive while preserving the transparency of syn- is about 8 words. However, a Turkish word may tax and semantics during derivations, in contrast to correspond to several English words, since the mor- the classical CG, which is context-free (Bar-Hillel et phological information which exists in the treebank al., 1964). represents additional information including part-of- speech, modality, tense, person, case, etc. The list of (3) Forward Composition (✲ B): the syntactic relations used to model the dependency ✳✵✴✏✶ ✁✸✷ ✶✾✴✠✿ ✁❁❀ ✻ ✳✵✴✠✿ ✁✠✟❂✣❃✥❄✷❅✑✬❀❆✣✼✚ relations are the following. Backward Composition (✽ B): 1.Subject 2. Object 3.Modifier ✶ ✕ ✿ ✁❁❀ ✳ ✕ ✶ ✁✸✷ ✻ ✳ ✕ ✿ ✁✠✟❂✣❃✥❄✷❅✑✬❀❆✣✼✚ 4.Possessor 5.Classifier 6.Determiner 7.Adjunct 8.Coordination 9.Relativiser (4) Forward Type Raising (✲ T): 10.Particles 11.S.Modifier 12.Intensifier ✳ ✁✺✹ ✻ ❇ ✴ ✑✬❇❈✕ ✳ ✚✠✁✠✟★✷❉✥❄✷❅❊ ✹☛❋ 13. Vocative 14. Collocation 15. Sentence Backward Type Raising (✽ T): 16.ETOL ✳ ✁✺✹ ✻ ❇●✕✍✑✬❇ ✴✰✳ ✚✠✁✠✟★✷❉✥❄✷❅❊ ✹☛❋ ETOL is used for constructions very similar to Composition and type raising are used to handle phrasal verbs in English. “Collocation” is used for syntactic coordination and extraction in languages the idiomatic usages and word sequences with cer- by providing a means to construct constituents that tain patterns. Punctuation marks do not play a role are not accepted as constituents in other theories. in the dependency structure unless they participate in a relation, such as the use of comma in coordi- 1.2 Relevant Work nation. The label “Sentence” links the head of the Julia Hockenmaier’s robust CCG parser builds a sentence to the punctuation mark or a conjunct in CCGlexiconforEnglishthatisthenusedbyastatis- case of coordination. So the head of the sentence tical model using the Penn Treebank as data (Hock- is always known, which is helpful in case of scram- enmaier, 2003). She extracts the lexical categories bling. Figure 1 shows how (5) is represented in the by translating the treebank trees to CCG derivation treebank. trees. As a result, the leaf nodes have CCG cat- (5) Kapının kenarındaki duvara dayanıp bize egories of the lexical entities. Head-complement baktı bir an. distinction is not transparent in the Penn Tree- (He)looked at us leaning on the wall next to bank so Hockenmaier uses an algorithm to find the the door, for a moment. heads (Collins, 1999). There are some inherent ad- vantages to our use of a dependency treebank that The dependencies in Turkish treebank are surface 1Substitution and others will not be mentioned here. Inter- dependencies. Phenomena such as traces and pro- ested reader should refer to Steedman (2000). drop are not modelled in the treebank. A word SENTENCE POSSESSOR MODIFIER OBJECT MODIFIER MODIFIER OBJECT DET Kapinin kenarindaki duvara dayanip bize bakti bir an . Door+GEN Side+LOC+REL wall+DAT lean us looked one moment Figure 1: The graphical representation of the dependencies from deps. to the head marked by morphological processes on the subor- ✁ dinate verb. However, the relative morphemes be- ✂☎✄✝✆ + ✂☎✄✟✞ + ✂✠✄✟✡ + ✂✠✄☞☛ have in a similar manner to relative pronouns in En- glish (C¸akıcı, 2002). This provides the basis for a Figure 2: The structure of a word heuristic method for recovering long range depen- dencies in extractions of this type, described in Sec- tion 3.5. can be dependent on only one word but words can recursiveFunction(index i, Sentence s) have more than one dependants. The fact that the headcat = findheadscat(i) dependencies are from the head of one constituent //base case to the head of another (Figure 2) makes it easier if myrel is “MODIFIER” to recover the constituency information, compared handleMod(headcat) to some other treebanks e.g. the Penn Treebank elseif “COORDINATION” wherenoclueisgivenregardingtheheadofthecon- handleCoor(headcat) stituents. elseif “OBJECT” Twoprinciples of CCG, Head Categorial Unique- cat = NP ness and Lexical Head Government, mean both ex- elseif “SUBJECT” tracted and in situ arguments depend on the same cat = NP[nom] category. This means that long-range dependen- elseif “SENTENCE” cies must be recovered and added to the trees to be cat = S used in the lexicon induction process to avoid wrong . predicate argument structures (Section 3.5). . 3 Algorithm if hasObject(i) combCat(cat,“NP”) The lexicon induction procedure is recursive on the if hasSubject(i) arguments of the head of the main clause. It is called combCat(cat,“NP[nom]”) for every sentence and gives a list of the words with //recursive case categories. This procedure is called in a loop to ac- forall arguments in arglist count for all sentential conjuncts in case of coordi- recursiveFunction(argument,s); nation (Figure 3). Long-range dependencies, which are crucial for Figure 3: The lexicon induction algorithm natural language understanding, are not modelled in the Turkish data. Hockenmaier handles them by 3.1 Pro-drop making use of traces in the Penn Treebank (Hock- enmaier, 2003)[sec 3.9]. Since Turkish data do not The subject of a sentence and the genitive pronoun have traces, this information needs to be recovered in possessive constructions can drop if there are from morphological and syntactic clues. There are morphological cues on the verb or the possessee. no relative pronouns in Turkish. Subject and object There is no pro-drop information in the treebank, extraction, control and many other phenomena are which is consistent with the surface dependency approach. A [nom] (for nominative case) feature 3.4 NPs is added to the NPs by us to remove the ambiguity Object heads are given NPcategories. Subject heads for verb categories. All sentences must have a are given NP[nom]. The category for a modifier of nominative subject.2 Thus, a verb with a category a subject NP is NP[nom]/NP[nom] and the modifier S✕ NP is assumed to be transitive. This information for an object NP is NP/NP since NPs are almost al- will be useful in generalising the lexicon during ways head-final. future work (Section 5). 3.5 Subordination and Relativisation original pro-drop The treebank does not have traces or null elements. transitive (S✕ NP[nom])✕ NP S✕ NP There is no explicit evidence of extraction in the intransitive S✕ NP[nom] S treebank; for example, the heads of the relative 3.2 Adjuncts clauses are represented as modifiers. In order to have Adjuncts canbegivenCCGcategories likeS/Swhen the samecategorytypeforalloccurences ofaverbto they modify sentence heads. However, adjuncts can satisfy the Principle of Head Categorial Uniqueness, modify other adjuncts, too. In this case we may heuristics to detect subordination and extraction play end up with categories like (6), and even more com- an important role. plex ones. CCG’s composition rule (3) means that (8) Kitabı okuyan adam uyudu. as long as adjuncts are adjacent they can all have Book+ACCread+PRESPARTmanslept. S/S categories, and they will compose to a single The man who read the book slept S/S at the end without compromising the semantics. These heuristics consist of morphological infor- This method eliminates many gigantic adjunct cate- mation like existence of a “PRESPART” morpheme gories with sparse counts from the lexicon, follow- in (8), and part-of-speech of the word. However, ing (Hockenmaier, 2003). there is still a problem in cases like (9a) and (9b). (6) daha✁☎✄ (((S/S)/(S/S))/((S/S)/(S/S)))/ Since case information is lost in Turkish extractions, (((S/S)/(S/S))/((S/S)/(S/S))) surface dependencies are not enough to differenti- ‘more’ ate between an adjunct extraction (9a) and an ob- 3.3 Coordination ject extraction (9b). A T.LOCATIVE.ADJUNCT de- The treebank annotation for a typical coordination pendencylinkisaddedfrom“araba”to“uyudugum”˘ example is shown in (7). The constituent which to emphasize that the predicate is intransitive and it is directly dependent on the head of the sentence, mayhavealocative adjunct. Similarly, a T.OBJECT “zıplayarak” in this case, takes its category accord- link is added from “kitap” to “okudugum”.˘ Similar ing to the algorithm. Then, conjunctive operator labels were added to the treebank manually for ap- is given the category (X✕ X)/X where X is the cat- proximately 800 sentences. egory of “zıplayarak” (or whatever the category of (9) a. Uyudugum˘ araba yandı. the last conjunct is), and the first conjunct takes the Sleep+PASTPARTcarburn+PAST. samecategory as X. The information in the treebank The car I slept in burned. is not enough to distinguish sentential coordination b. Okudugum˘ kitap yandı. and VPcoordination. There are about 800 sentences Read+PASTPARTbookburn+PAST. of this type. We decided to leave them out to be an- The book I read burned. notated appropriately in the future. The relativised verb in (9b) is given a transi- Mod. Coor. Mod. Sentence tive verb category with pro-drop, (S✕ NP), instead (7) Kos¸arak ve zıplayarak geldi . of (NP/NP)✕ NP, as the Principle of Head Catego- Hecamerunning and jumping. rial Uniqueness requires. However, to complete the process we need the relative pronoun equiv- 2This includes the passive sentences in the treebank alent in Turkish,-dHk+AGR. A lexical entry with
no reviews yet
Please Login to review.