195x Filetype PDF File size 0.19 MB Source: aclanthology.org
Cross Linguistic Variations in Discourse Relations among Indian Languages Sindhuja Gopalan Lakshmi s Sobha Lalitha Devi AU-KBC Research Centre AU-KBC Research Centre AU-KBC Research Centre MIT Campus of Anna MIT Campus of Anna MIT Campus of Anna University, Chennai, India University, Chennai, India University, Chennai, India sindhujagopalan@au- lakssreedhar@gmail.com sobha@au-kbc.org kbc.org the analysis of explicit discourse relations and Abstract developed an automatic discourse relation identi- fication system. During error analysis various This paper summarizes our work on analysis structural interdependencies were also noted. of cross linguistic variations in discourse rela- Discourse tagging for Indian languages Hindi, tions for Indo-Aryan language Hindi and Malayalam and Tamil has been done by Sobha et Dravidian languages Malayalam and Tamil. al., (2014) Other published works on discourse In this paper we have also presented an auto- relation annotations in Indian languages are in matic discourse relation identifier, which Hindi (Kolachina et al., (2012); Oza et al., gave encouraging results. Analysis of the re- (2009)) and Tamil (Rachakonda and Sharma sults showed that some complex structural in- (2011)). Menaka et al., (2011) in their paper have ter-dependencies existed in these three lan- automatically identified the causal relations and guages. We have described in detail the struc- have described about the structural interdepend- tural inter-dependencies that occurred. Dis- course relations in the three languages thus encies that exist between the relations. Similarly, exhibited complex nature due to the structural we observed the existence of structural interde- inter-dependencies. pendencies between the discourse relations in 1 Introduction three languages, which we have explained in de- tail. From the previous works on discourse rela- Discourse relations link clauses in text and com- tion annotation for various Indian languages, we pose overall text structure. Discourse relations can observe that the study of discourse relations are used in natural language processing (NLP), is carried out for specific Indian language and including text summarization and natural lan- hence we attempted to discuss the cross linguistic guage generation. The analysis and modeling of variations among Hindi, Tamil and Malayalam discourse structure has been an important area of languages. linguistic research and it is necessary for building Researchers have performed identification and efficient NLP applications. Hence the automatic extraction of discourse relation using cue based detection of discourse relation is also important. or statistical methods. Penn Discourse Tree Bank The Indo-Aryan (Hindi) and Dravidian languages (PDTB) is the large scale annotated corpora of (Malayalam and Tamil) share certain similarities linguistic phenomena in English (Prasad et al., such as verb final language, free word order and 2008). The PDTB is the first to follow the lexi- morphologically rich inflections. Due to the in- cally grounded approach to annotation of dis- fluence of Sanskrit in these languages they are course relations. Marcu and Echihabi (2012) similar at lexical level. But structurally they are have focused on recognition of discourse relation very different. In this work we have presented an using cue phrases, but not extraction of argu- analysis of the cross linguistic variations in the ments. Wellner and Pustejovksy (2007) in their discourse relations among three languages Hindi, study considered the problem of automatically Malayalam and Tamil. Instead of identifying all identifying the arguments of discourse connec- possible discourse relations we have considered tives in PDTB. They re-casted the problem to 402 that of identifying the argument heads, instead of S Bandyopadhyay, D S Sharma and R Sangal. Proc. of the 14th Intl. Conference on Natural Language Processing, pages 402–407, c Kolkata, India. December 2017. 2016 NLP Association of India (NLPAI) identifying the full extents of the arguments as section 2, cross linguistic variations in discourse annotated in PDTB. To address the problem of relations among three languages is given in sec- identifying the arguments of discourse connec- tion 3, method used for the automatic identifica- tives they incorporated a variety of lexical and tion of discourse relation and the results are de- syntactic features in a discrimination log-linear scribed in section 4 and the various structural re-ranking model to select the best argument pair interdependencies that occur in the three lan- from a set of N best argument pairs provided by guages is described in section 5. The paper ends independent argument models. They obtained with the conclusion section. 74.2% accuracy using gold standard parser and 2 Corpus collection and Annotation 64.6% accuracy using automatic parser for both arguments. Elwell and Baldridge (2008) have Health related articles were chosen from web and used models tuned to specific connectives and after removing inconsistencies like hyperlinks a connective types. Their study showed that using total corpus of 5000 sentences were obtained. models for specific connectives and types of Then we annotated the corpus for connectives connectives and interpolating them with a gen- and its arguments. The discourse relation annota- eral model improves the performance. The fea- tion was purely syntactic. The arguments were tures used to improve performance include the labeled as arg1 and arg2 and arg2 was chosen to morphological properties of connectives and be following arg1. When free words occur, we their arguments, additional syntactic configura- tag them separately and the discourse unit be- tion and wider context of preceding and follow- tween which the relation is inferred is marked as ing connectives. The system was developed on arg1 and arg2. When the connectives exist as PDTB. They used Maximum entropy ranker. bound morphemes we keep them along with the Models were trained for arg1 and arg2 selection word to which it is attached and include it under separately. They achieved 77.8% accuracy for arg1. The annotated corpus contains 1332 explic- identifying both arguments of connective for it connectives in Hindi, 1853 in Malayalam and gold standard parser and 73.6% accuracy using 1341 in Tamil. From the data statistics we can automatic parser. Ramesh and Yu (2010) have observe that Malayalam language has more developed a system for identification of dis- number of connectives than Tamil and Hindi. course connectives in bio-medical domain. They Annotated corpus is used to train the system and developed the system on BioDRB corpus using the models are built for the identification of con- CRFs algorithm. For PDTB data they obtained F- nectives and arguments. score of 84%. They obtained F-score of 69% for BioDRB data. For PDTB based classifier on Bi- 3 Cross Linguistic variations in Dis- oDRB data, they obtained F-score of 55%. In this course Relations work they did not focus on identification of ar- guments. Versley (2010) presented his work on The discourse relation in Indian language can be tagging German discourse connectives using a expressed in many ways. It can be syntactic (a German–English parallel corpus. AlSaif (2012) suffix) or lexical. It can be within a clause, inter- used machine learning algorithms for automati- clausal or inter-sentential. The various cross lin- cally identifying explicit discourse connectives guistic variations in discourse relation among the and its relations in Arabic language. Wang et al., three languages is analyzed and described below. (2012) used sub-trees as features and identified 3.1 Discourse Connectives explicit and implicit connectives and their argu- ments. Zhou et al., (2012) presented the first ef- Discourse relations can be inferred using Explicit fort towards cross lingual identification of the or Implicit connectives. Explicit connectives ambiguities of discourse connectives. Faiz et al., connect two discourse units and trigger discourse (2013) did explicit discourse connectives identi- relation. The explicit connectives can be realized fication in the PDTB and the Biomedical Dis- in any of the following ways. course Relation Bank (BDRB) by combining Subordinators that connect the main certain aspects of the surface level and syntactic clause with the subordinate or dependent feature sets. In this study we tried to develop a clause. (For example: agar-to, jabkI in discourse parser for all three languages for iden- Hindi, appoL, -aal in Malayalam and - tification of connectives and its arguments. aal, ataal in Tamil). Following sections are organized as follows. Coordinators which connect two or more 403 Corpus Collection and Annotation is described in items of equal syntactic importance. They connect two independent clauses. In Tamil and Malayalam the connective “and” (For example: “aur”, “lekin” in Hindi, “- exists in the form as in Example 3. In Hindi sin- um”, “ennaal” in Malayalam and gle lexicon “aur” serves this purpose. “anaal”, “athanaal” in Tamil). Example 3: Conjunct adverbs that connect two inde- [muuttukaLiluLLa kuRuththelumpu vaLaraamal pendent clauses and modify the clauses in knee cartilage without or sentences in which they occur. (For theymaanam atainthaalum]/arg1, growing wear if get-and example: “isliye”, “halaanki” in Hindi, [angkuLLa vazhuvazhuppaana thiravam “athinaal”, “aakayaal” in Malayalam and “enninum”, “aakaiyaal” in Tamil). there smooth fluid Correlative conjunctions which are kuRainthupoonaalum]/arg2 muuttukaLil uraayvu paired conjunctions. They link words or get less-and knee friction group of words of equal weights in a eRpatum. sentence. (For example: “na keval balki” will develop in Hindi, “maathramalla-pakshe” in Mal- (If cartilage in the knee gets wear without ayalam and “mattumalla-aanaal” in Tam- growing and if the smooth fluid present there il). becomes less, friction will develop in the knee.) 3.2 Position of Connectives 3.3 Agglutinated and intra sentence In our approach we have done a syntactic based In Malayalam and Tamil connectives can occur tagging. In Hindi, Malayalam and Tamil dis- as free words or bound morphemes. But in Hindi course connectives can occur within a sentence only free word connectives exist as in Example or between sentences. In all the three languages 2. inter sentence connectives are said to occupy Example 4: sentence initial position. Example 1 shows the [vayiRRil kutalpun irunthaal]/arg1 [vayiRu inter sentence discourse relation in Malayalam. In stomach ulcer is there-if stomach Example 1: valikkum]/arg2. [chila aaLukaL mukhsoundaryam koottaan will pain Some people facial-beauty increase (If there is ulcer in stomach, stomach will pain.) kreemukaL upayogikkaaruNt.]/arg1 3.4 Paired connectives creams use ennaal [athu guNathekkaaLeRe doshamaaN In Hindi some discourse connectives were seen But that goodness-more than harm-is as paired connectives. This type of connectives is cheyyuka.]/arg2 not noticed in Malayalam and Tamil. do Example 5: (Some people use creams to increase their facial yadhii [lagaathaar buKaar aa rahaa hai]/arg1 tho beauty. But that will do more harm than good.) if constantly fever coming is then We found that there exists a difference in the [uskii jaaNca avashaya karaaye]/arg2. position of conjunct adverb “although” among its check sure do the three languages. As in Example 2, in Hindi (If fever is coming constantly, then check it for this connective occurs in the sentence initial po- sure.) sition whereas in Tamil and Malayalam this con- In the above Example 5 “yadhii-to” is the nective occurs in the middle position and remains paired connective that occurs at the start of arg1 agglutinated with the verb. and arg2. Whereas in Tamil and Malayalam it Example 2: occurs as a single connective as in Example 4 haalaaMki [yoga pakshaaGaath kii samasyaa kaa and occurs agglutinated with verb. although yoga paralysis problem's 3.5 Arguments of Relations sTaayii samaaDhaan karthaa hai]/arg2, permanent solution do is In our approach the label assignment is syntactic. [yah samay lethaa hai evaM shramsaaDya Sometimes, the arguments can be in the same This time take is and painstaking sentence as the connective. Sometimes, one of hai]/arg1 the preceding sentence acts as an argument. Also is the argument can be a non-adjacent sentence. But (Although yoga gives a permanent solution for the text span follows the minimality-principle. In paralysis, this is time taking and painstaking.) 404 Example 1 the connective “ennal” in Malayalam connects two discourse units inter sententially. sentences from the corpus for training and 1000 The discourse unit that follows the connective is sentences for testing. For testing, the sentences arg2 and the preceding unit is arg1. In Example 4 are pre-processed similarly as training data. The the arguments for connective “-aal” in Tamil oc- system identified the discourse markers in stage cur in same sentence. 1 and this output becomes input to stage 2. In 4 Automatic identification of discourse both the stages we used CRFs as the machine learning algorithm. relation The performance of our system is measured in 4.1 Method Used terms of Precision, Recall and F score. Precision is the number of discourse relations correctly We have used the method adopted by Menaka et perceived by the system from the total number of al., (2011) for the identification of discourse rela- discourse relations identified, Recall is the num- tions. We have preprocessed the text for morph ber of discourse relations correctly detected by analysis (Ram et al, 2010), part-of-speech tag- the system by the total number of discourse rela- ging (PoS) (Sobha et al, 2016), chunking (Sobha tions contained in the input text and F-score is and Ram, 2006), clause tagging (Ram et al, the harmonic mean of precision and recall. 2012). The implementation is done based on ma- The results for connective identification are chine learning technique CRFs. tabulated in Table 1. 4.2 Conditional Random Fields Precision Recall F- CRFs is an undirected graphical model, where score the conditional probabilities of the output are Hindi 96.33 92.3 94.27 maximized for a given input sequence. We chose Malayalam 96.3 91.6 93.89 CRFs, because it allows linguistic rules or condi- Tamil 95.35 94.18 94.76 tions to be incorporated into machine learning Table 1: Results for Connective Identification algorithm. Here, we have used CRF++ (Kudo, The argument identification results are given 2005), an open source toolkit for linear chain in Table 2, Table 3, Table 4 and Table 5. CRFs. 4.3 Features Used Precision Recall F- For the identification of connectives, we have score used PoS tagging information, morphological Hindi 76 72.2 74.05 suffixes and clause information as features for Malayalam 78.5 72 75.1 Malayalam and Tamil. Morphological suffixes Tamil 81.53 73.6 77.36 such as conditional markers, causal markers, rel- Table 2: Results for ARG1 Start ative participle (RP) marker followed by postpo- Precision Recall F- sition (PSP) and coordination markers were used. score For connective identification in Hindi, word, PoS Hindi 75.9 72.2 74 tagging information and chunk information were Malayalam 78.8 72 75.23 used. For argument identification we have taken Tamil 82 72.6 77 PoS tagging information, chunk information, Table 3: Results for ARG1 End morphological suffixes, and clause information, combination of PoS and chunk information and Precision Recall F- connectives as features. score 4.4 Training and Testing Hindi 77.4 73.2 75.24 For identifying the discourse connectives, we Malayalam 79.2 73 75.97 trained the system using the features for connec- Tamil 81.5 72.6 76.79 tives. In the next stage we train the system to Table 4: Results for ARG2 Start identify the arguments and their text spans. Here we have built 4 language models for each of the Precision Recall F- 4 boundaries – Arg2-START, Arg1-END, Arg1- score START and Arg2-END motivated by the work Hindi 76.3 71.2 73.66 of Menaka et al., (2011). The system was trained Malayalam 78.7 72.4 75.42 405 Tamil 82 72.7 77 in 4 phases to develop 4 models. We used 4000
no reviews yet
Please Login to review.