111x Filetype PDF File size 0.06 MB Source: aclanthology.org
Hindi Compound Verbs and their Automatic Extraction Debasri Chakrabarti Hemang Mandalia Ritwik Priya Humanities and Social Computer Science and En- Computer Science and En- Sciences Department gineering Department gineering Department IIT Bombay IIT Bombay IIT Bombay debasri@iitb.ac.in hemang.rm@gmail.com ritwik@cse.iitb.ac.in Vaijayanthi Sarma Pushpak Bhattacharyya Humanities and Social Sci- Computer Science and En- ences Department gineering Department IIT Bombay IIT Bombay vsarma@iitb.ac.in pb@cse.iitb.ac.in Abstract non-CP V+V sequences. Of the CPs thus iso- lated, we need to distinguish between those CPs We analyse Hindi complex predicates that are formed in the syntax (derivationally) and and propose linguistic tests for their de- those that are formed in the lexicon (LCpdVs) in tection. This analysis enables us to iden- order to include only the latter in lexical knowl- tify a category of V+V complex predi- edge bases. Further, automatic extraction of cates called lexical compound verbs LCpdVs from electronic corpora and their inclu- (LCpdVs) which need to be stored in the sion in lexical knowledge bases is a desirable dictionary. Based on the linguistic analy- goal for languages like Hindi, which liberally use sis, a simple automatic method has been CPs. devised for extracting LCpdVs from cor- This paper discusses Hindi Verb+Verb (V+V) pora. We achieve an accuracy of around CPs and their automatic extraction from a corpus. 98% in this task. The LCpdVs thus ex- 1.1 Related work tracted may be used to automatically Alsina (1996) discusses the general theory of augment lexical resources like wordnets, complex predicates. Early work on conjunct and an otherwise time consuming and labour- compound verbs in Hindi appears in Burton-Page intensive process (1957) and Arora (1979). Our work on diagnostic 1 Introduction tests for CPs, as reported here, has been inspired by Butt (1993, 1995 for Urdu) and Paul (2004, Complex predicates (CPs) abound in South for Bengali). The analysis of lexical derivation of Asian languages [Butt, 1995; Hook, 1974] pri- LCpdVs derives from the work on compound marily as either, noun+verb combinations (con- verbs by Abbi (1991, 1992) and Gopalkrishnan junct verbs) or verb+verb (V+V) combinations and Abbi (1992). (compound verbs). This paper discusses the lat- This work is motivated primarily by the need ter. to automatically augment lexical networks such Of the many V+V sequences in Hindi, only a as the Princeton Wordnet (Miller et. al., 1990) subset constitutes true CPs. Thus, we first need and the Hindi Wordnet (Narayan et. al., 2002). diagnostic tests to differentiate between CP and Pasca (2005) and Snow et. al. (2006) report work on such augmentations by processing web docu- © 2008. Licensed under the Creative Commons Attri- ments. bution-Noncommercial-Share Alike 3.0 Unported To the best of our knowledge ours is the first license (http://creativecommons.org/licenses/by-nc- attempt at automatic extraction of LCpdVs from sa/3.0/). Some rights reserved. Hindi corpora. 27 Coling 2008: Companion volume – Posters and Demonstrations, pages 27–30 Manchester, August 2008 how they are formed. To accomplish this we ex- 1.2 Organization of the paper amined the semantic properties of the second verbs (V2) in Group 1: Section 2 discusses CPs in Hindi and the ways to distinguish them from other, similar looking, (1) V1inf+paRnaa: constructions. Section 3 discusses the automatic Examples include karnaa paRaa ‘do-lie (had to extraction of CPs from corpora. Section 4 con- do)’, bolnaa paRaa ‘say-lie (had to say)’ etc. The cludes the paper. second verb is always paRnaa ‘to lie (lay)’. It 2 V+V Complex Predicates in Hindi appears in its stem form and bears all the inflec- tions. As V2, paRnaa has the meaning of com- We have identified five different types of V+V pulsion/force. paRnaa ‘lie’ as a V2 can be com- sequences in Hindi. These are: bined with any V1 irrespective of the latter’s se- mantic properties. Since there are no syntactic or 1. V1 stem+V2: maar Daalnaa (kill-put) ‘kill’. semantic restrictions on the selection of V1, this 2. V1 inf-e+lagnaa: rone lagnaa (cry-feel) ‘start construction should be treated in the syntax as a crying’. combination of a V1 and a modal auxiliary. 3. V1 inf+paRnaa: bolnaa paRaa (say-lie) ‘say’. 4. V1 inf-e+V2: likhne ko/ke lie kahaa ‘asked to (2) V1 inf-e+lagnaa: write’. Examples include karne lagaa ‘do-feel (start to 5. V1–kar+V2: lekar gayaa ‘took and went’. do)’, bolne lagaa ‘say-feel (start to say)’ etc. The V2 in this sequence is always lagnaa ‘feel’ in the 2.1 Identification of CPs] bare form and carries all the inflections. The core Following Butt (1993) and Paul (2004), we use meaning of lagnaa ‘feel’ is lost when it is com- the following diagnostic tests to identify CPs in bined with a V1. As a V2 it always has the mean- Hindi: ing of beginning, happening of an event. lagnaa ‘feel’ as a V2 can be combined with any V1 irre- 1. Scope of adverbs spective of the latter’s semantic properties. Thus, 2. Scope of negation this is also an instance of a modal auxiliary and 3. Nominalization should be derived in the syntax. 4. Passivization (3) V1stem+V2 5. Causativization In the formation of V1 stem+V2, the V2 may be 6. Movement any one of ten verbs, as shown in Figure 1. (see Appendix A for an example of these tests) 1. Daalnaa ‘put’ 2. lenaa ‘take’ The tests above have been exhaustively applied 3. denaa ‘give’ to varied data. The results of these tests show 4. uThnaa ‘wake’ that some V+V sequences function as single se- mantic units and others do not. They also show 5. jaanaa ‘go’ 6. paRnaa ‘lie’ that the V1stem+V2, V1inf-e+lagnaa and 7. baiThnaa ‘sit’ V1inf+paRnaa sequences show similar proper- 8. maarnaa ‘kill’ ties and the V1 inf-e+V2 stem and the V1– 9. dhamaknaa ‘throb’ kar+V2 behave similarly. We call these Group 1 10. girnaa ‘fall’ and Group 2 respectively. Group 1 sequences are true CPs in Hindi. The Figure 1: The 10 vector verbs V+V sequences are simple predicates (mono- All these V2s also occur as main verbs. As V2, clausal) with one subject. Group 2 constructions the core meaning of these verbs is lost are not CPs. They show clausal embedding and (bleached), but they acquire some new semantic each verb behaves as if it were an independent properties which are otherwise not seen (Abbi, syntactic entity. In the next section we summa- 1991, 1992; Gopalkrishnan and Abbi, 1992). The rize the semantic properties of CPs (Group 1). semantic properties of V2s include finality, defi- 2.2 Semantic Properties of V2 in Group 1 niteness, negative value, manner of the action, After identifying the CPs from among different attitude of the speaker etc. V+V sequences, the next step was to determine The combination of V1 and V2 is subject to the semantic compatibility between the two verbs. 28 The argument structure of the CP is determined BBC 40 8 4 28 0.7 by V1 as is the case-marking on the internal ar- (28/4 guments, but the case-marking on the external 0) argument (subject) is determined by both verbs. CIIL 174 32 7 135 0.79 From this analysis we conclude that V+V (135/ CPs are formed both lexically and syntactically 174) in Hindi. Detailed investigation shows us that the Table 1: Precision of LCpdV extraction The loss in precision was caused by (i) part of V2 in the V1inf-e+lagnaa and the V1inf+paRnaa constructions is a type of modal speech ambiguity, (ii) passivisation and (iii) auxiliary and its semantic features are predictable idiomatic usages. For lack of space, we do not and unvarying. We propose to deal with these discuss this here. verbs in the syntax and call these verbs syntactic When measures were taken to remedy these compound verbs (SCpdVs). The V2 choice in the errors, we reached an accuracy of close to 98% V1stem+V2 is not predictable and the CPs func- (see table 2). tion as a single complex of syntactic and seman- tic features. We call these verbs lexical com- BBC CIIL pound verbs (LCpdVs) and we propose to in- Confirmed LCpdVs 423 953 clude them in the lexical knowledge base. In the (A) next section we provide a heuristic for automatic Not LCpdVs (B) 13 12 extraction of LCpdVs for storage in the lexicon. Different POS (C) 65 179 Possible LCpdVs but 44 36 2.3 The Extraction Process contexts insufficient By scanning the corpus, V1stem+V2 sequences (D) were found given the heuristic H* specified in Minimum Precision 0.88 0.95 Figure 2. (A/(A+B+D)) (423/480) (953/1001) Maximum Precision 0.97 0.99 (Heuristic H*) ((A+B)/(A+B+D)) (467/480) (989/1001) If a verb V1 is in the stem form and Total V1stem+V2 10,145 36,115 constructions in the is followed by a verb V2 from a pre- corpus stored list of verbs that can form the Table 2: Final results of LCpdV extraction second component of the CP (section 2.2, Figure 3), i.e., the ‘vector’, then A partial list of LCpdVs extracted from a test run this verb along with the V2 is taken on the CIIL corpus is presented in Table 3. to be an instance of an LCpdV. baandh Kar Bhar le jaanaa Banaa Figure 2: Main heuristic for identifying LCpdVs denaa lenaa denaa ‘take’ denaa ‘tie’ ‘do’ ‘fill’ ‘make’ Ten native speakers of Hindi were consulted. jaan kaaT Kar de- Badal Bhuul They were asked to construct sentences with the lenaa denaa naa ‘do’ jaanaa jaanaa extracted sequences. If they were able to do so, ‘know’ ‘cut’ ‘change’ ‘forget’ that sequence was registered as a true LCpdV. jalaa Gir Samajh Samjhaa Khod The precision of the heuristic is calculated as denaa jaanaa lenaa denaa lenaa the ratio of the actual LCpdVs arrived at through ‘burn’ ‘fall’ ‘under- ‘make ‘dig’ manual validation to the total number of antici- stand’ under- stand’ pated LCpdVs identified by the heuristic. lauTaa Rah Le lenaa De denaa ghusaa The results of these calculations are shown in denaa jaanaa ‘take’ ‘give’ denaa Table 1, with a precision rate of 70% for the ‘return’ ‘stay’ ‘enter’ BBC corpus and 79% for the CIIL one. Table 3: Examples of LCpdV extraction 3 Conclusions and Future Work Cor- To- POS Pas- LCpdVs Preci pus tal ambi sive (manu- sion In this paper, we have presented a study of Hindi de- gui- forms ally compound verbs, proposed diagnostic tests for tec- ties de- their detection and given automatic methods for tio tected) their extraction from a corpus. Native speakers ns 29 verify that the accuracy of our method is close to Appendix A. Example of a diagnostic Test for 98% on representative corpora. LCpdVs: scope of adverbs Future work will consist in inserting the ex- tracted LCpdVs into lexical resources such as the Verb Example Comment CP? 2 Type Hindi wordnet at the right places with the right links. V1 us-ne jaldii Scope over Yes stem+ jaldii the whole References V2 khaa li- sequence aa‘(S)he Abbi, Anvita. 1991. Semantics of explicator com- ate pound verbs. In South Asian Languages, Language quickly.’ Sciences, 13:2, 161-180 V1inf- vah jaldii Scope over Yes .Abbi, Anvita. 1992. The explicator compound verb: e+ lag- se khaan-e the whole some definitional issues and criteria for identifica- naa lag-aa ‘He sequence tion. Indian Linguistics, 53, 27-46. started eat- ing imme- Alsina, Alex. 1996. Complex Predicates:Structure diately.’ and Theory. CSLI Publications,Stanford, CA. V1 mujhe yah Scope over Yes Arora, H. 1979. Aspects of Compound Verbs in Hindi. inf+ kaam jaldii the whole M.Litt. dissertation, Delhi University. paRnaa karnaa sequence paR-aa ‘I Burton-Page, J. 1957. Compound and conjunct verbs had to do in Hindi. BSOAS 19 469-78. the work Butt, M. 1993. Conscious choice and some light verbs quickly.’ in Urdu. In M. K. Verma ed. (1993) Complex V1inf- us-ne mu- Either over No Predicates in South Asian Languages. Manohar e+V2 jhe khat V1 or V2 de- Publishers and Distributors, New Delhi. jaldii se pends upon likhn-e the syntactic Butt, M. 1995. The Structure of Complex Predicates kah-aa ‘He position of in Urdu. Doctoral Dissertation, Stanford Univer- asked me the adverb sity. to write the Cruys Time De and B. V. Moiron. 2007. Semantics- letter based multiword expression extraction. ACL-2007 quickly.’ Workshop on Multiword Expressions. V1– vah jaldii Either over No kar+ se nahaa- V1 or V2 de- Gopalkrishnan, D. and Abbi, A. 1992. The explicator V2 kar aa- pends upon compound verb: some definitional issues and crite- yeg-aa the syntactic ria for identification. Indian Linguistics, 53, 27-46. ‘He will position of Miller,G., R. Beckwith, C. Fellbaum,, D. Gross, and take bath the adverb K. Miller, Five Papers on WordNet. CSL Report quickly and 43, Cognitive Science Laboratory, Princeton Uni- come.’ versity, Princeton, 1990. http://www.cogsci.princeton.edu/~wn Narayan, D., D. Chakrabarty, P. Pande, and P. Bhat- tacharyya. 2002. An experience in building the Indo WordNet - a WordNet for Hindi, International Conference on Global WordNet (GWC 02), My- sore, India, January. Pasca, Marius, 2005. finding instance names and al- ternative glosses on the web: WordNet reloaded. Proceedings of CICLing, Mexico City. Snow, Rion, Dan Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous evidence. Proceedings of COLING/ACL, Sydney. 2 Developed by the wordnet team at IIT Bombay, www.cfilt.iitb.ac.in/webhwn 30
no reviews yet
Please Login to review.