Hindi Language Pdf 105317

Partial capture of text on file.
                           Cross Linguistic Variations in Discourse Relations among Indian   
                                                                         Languages 
                       
                          
                            Sindhuja  Gopalan                                Lakshmi s                           Sobha Lalitha Devi 
                        AU-KBC Research Centre                     AU-KBC Research Centre                    AU-KBC Research Centre 
                           MIT Campus of Anna                         MIT Campus of Anna                        MIT Campus of Anna 
                        University,  Chennai,  India               University,  Chennai,  India              University,  Chennai,  India 
                      sindhujagopalan@au- lakssreedhar@gmail.com  sobha@au-kbc.org 
                                  kbc.org 
                       
                                                                            
                       
                                                                                     the  analysis  of  explicit  discourse  relations  and 
                                             Abstract                                developed an automatic discourse relation identi-
                                                                                     fication  system.  During  error  analysis  various 
                          This paper summarizes our work on analysis                 structural interdependencies were also noted. 
                          of cross linguistic variations in discourse rela-             Discourse tagging for Indian languages Hindi, 
                          tions  for  Indo-Aryan  language  Hindi  and               Malayalam and Tamil has been done by Sobha et 
                          Dravidian languages Malayalam and Tamil.                   al.,  (2014)  Other published works on discourse 
                          In this paper we have also presented an auto-              relation  annotations  in  Indian  languages  are  in 
                          matic  discourse  relation  identifier,  which             Hindi  (Kolachina  et  al.,  (2012);  Oza  et  al., 
                          gave encouraging results. Analysis of the re-              (2009))  and  Tamil  (Rachakonda  and  Sharma 
                          sults showed that some complex structural in-              (2011)). Menaka et al., (2011) in their paper have 
                          ter-dependencies existed in  these  three  lan-            automatically  identified the causal relations and 
                          guages. We have described in detail the struc-             have described about the structural interdepend-
                          tural  inter-dependencies  that  occurred.  Dis-
                          course relations  in  the  three  languages  thus          encies that exist between the relations. Similarly, 
                          exhibited complex nature due to the structural             we observed the existence of structural interde-
                          inter-dependencies.                                        pendencies  between  the  discourse  relations  in 
                      1    Introduction                                              three languages, which we have explained in de-
                                                                                     tail. From the previous works on discourse rela-
                      Discourse relations link clauses in text and com-              tion annotation for various Indian languages, we 
                      pose  overall  text  structure.  Discourse  relations          can observe that the study of discourse relations 
                      are  used  in  natural  language  processing  (NLP),           is  carried  out  for  specific  Indian  language  and 
                      including  text  summarization  and  natural  lan-             hence we attempted to discuss the cross linguistic 
                      guage generation. The analysis and modeling of                 variations  among  Hindi,  Tamil  and  Malayalam 
                      discourse structure has been an important area of              languages. 
                      linguistic research and it is necessary for building              Researchers have performed identification and 
                      efficient NLP applications. Hence the automatic                extraction of discourse relation using cue based 
                      detection of discourse relation is also important.             or statistical methods. Penn Discourse Tree Bank 
                      The Indo-Aryan (Hindi) and Dravidian languages                 (PDTB) is the large  scale  annotated  corpora  of 
                      (Malayalam and Tamil) share certain similarities               linguistic  phenomena  in  English  (Prasad  et  al., 
                      such as verb final language, free word order and               2008). The PDTB is the first to follow the lexi-
                      morphologically rich  inflections. Due to the in-              cally  grounded  approach  to  annotation  of  dis-
                      fluence  of  Sanskrit  in  these  languages  they are          course  relations.  Marcu  and  Echihabi  (2012) 
                      similar at lexical  level. But structurally they are           have focused on recognition of discourse relation 
                      very different. In this work we have presented an              using  cue  phrases,  but  not  extraction  of  argu-
                      analysis of the cross linguistic variations in the             ments. Wellner and Pustejovksy (2007)  in their 
                      discourse relations among three languages Hindi,               study  considered  the  problem  of  automatically 
                      Malayalam and Tamil. Instead of identifying all                identifying  the  arguments  of  discourse  connec-
                      possible discourse relations we have considered                tives  in  PDTB.  They  re-casted  the  problem  to 
                                                                              402 that of identifying the argument heads, instead of 
                 S Bandyopadhyay, D S Sharma and R Sangal. Proc. of the 14th Intl. Conference on Natural Language Processing, pages 402–407,
                                                                            c
                                            Kolkata, India. December 2017. 
2016 NLP Association of India (NLPAI)
                    identifying  the  full  extents  of  the  arguments  as     section 2, cross linguistic variations in discourse 
                    annotated in PDTB. To address the problem of                relations among three languages is given in sec-
                    identifying  the  arguments  of  discourse  connec-         tion 3, method used for the automatic identifica-
                    tives  they  incorporated  a  variety  of  lexical  and     tion of discourse relation and the results are de-
                    syntactic  features  in  a  discrimination  log-linear      scribed  in  section  4  and  the  various  structural 
                    re-ranking model to select the best argument pair           interdependencies  that  occur  in  the  three  lan-
                    from a set of N best argument pairs provided by             guages is described in section 5. The paper ends 
                    independent  argument  models.  They  obtained              with the conclusion  section. 
                    74.2% accuracy using gold standard parser and               2     Corpus collection and Annotation 
                    64.6% accuracy using automatic parser for both 
                    arguments.  Elwell  and  Baldridge  (2008)  have            Health related articles were chosen from web and 
                    used  models  tuned  to  specific  connectives  and         after removing  inconsistencies  like hyperlinks a 
                    connective types. Their study showed that using             total  corpus  of  5000  sentences  were  obtained. 
                    models  for  specific  connectives  and  types  of          Then  we annotated  the  corpus  for  connectives 
                    connectives  and  interpolating  them  with  a  gen-        and its arguments. The discourse relation annota-
                    eral model  improves the performance. The fea-              tion  was  purely  syntactic.  The  arguments  were 
                    tures  used  to  improve  performance  include  the         labeled as arg1 and arg2 and arg2 was chosen to 
                    morphological  properties  of  connectives  and             be  following  arg1.  When  free words  occur,  we 
                    their  arguments,  additional  syntactic  configura-        tag  them  separately  and  the  discourse  unit  be-
                    tion and wider context of preceding and follow-             tween which the relation is inferred is marked as 
                    ing  connectives.  The system  was  developed  on           arg1  and  arg2.  When  the  connectives  exist  as 
                    PDTB.  They  used  Maximum  entropy  ranker.                bound morphemes we keep them along with the 
                    Models were trained for arg1 and arg2 selection             word to which it is attached and include it under 
                    separately.  They  achieved  77.8%  accuracy  for           arg1. The annotated corpus contains 1332 explic-
                    identifying  both  arguments  of  connective  for           it  connectives in Hindi, 1853 in Malayalam and 
                    gold standard parser and 73.6% accuracy using               1341  in  Tamil.  From  the  data  statistics  we  can 
                    automatic  parser.  Ramesh  and  Yu  (2010)  have           observe  that  Malayalam  language  has  more 
                    developed  a  system  for  identification  of  dis-         number  of  connectives  than  Tamil  and  Hindi. 
                    course connectives in bio-medical domain. They              Annotated corpus is used to train the system and 
                    developed the system on BioDRB corpus using                 the models are built for the identification of con-
                    CRFs algorithm. For PDTB data they obtained F-              nectives and arguments.  
                    score of 84%. They obtained F-score of 69% for 
                    BioDRB data. For PDTB based classifier on Bi-               3     Cross  Linguistic  variations  in  Dis-
                    oDRB data, they obtained F-score of 55%. In this                  course Relations 
                    work they did not focus on  identification of ar-
                    guments. Versley (2010) presented his work on               The discourse relation in Indian language can be 
                    tagging  German  discourse  connectives  using  a           expressed in many ways. It can be syntactic (a 
                    German–English  parallel  corpus.  AlSaif  (2012)           suffix) or lexical. It can be within a clause, inter-
                    used machine learning algorithms for automati-              clausal or inter-sentential. The various cross lin-
                    cally  identifying  explicit  discourse  connectives        guistic variations in discourse relation among the 
                    and its relations in Arabic language. Wang et al.,          three languages is analyzed and described below.  
                    (2012)  used sub-trees as features and identified           3.1    Discourse Connectives 
                    explicit and implicit connectives and their argu-
                    ments. Zhou et al., (2012) presented the first ef-          Discourse relations can be inferred using Explicit 
                    fort  towards  cross  lingual  identification  of  the      or  Implicit  connectives.  Explicit  connectives 
                    ambiguities of discourse connectives. Faiz et al.,          connect two discourse units and trigger discourse 
                    (2013) did explicit discourse connectives identi-           relation. The explicit connectives can be realized 
                    fication  in  the  PDTB  and  the  Biomedical  Dis-         in any of the following  ways. 
                    course  Relation  Bank  (BDRB)  by  combining                        Subordinators  that  connect  the  main 
                    certain aspects of the surface level and syntactic                   clause with the subordinate or dependent 
                    feature sets.  In this study we tried to develop a                   clause.  (For  example:  agar-to,  jabkI  in 
                    discourse parser for all three languages for iden-                   Hindi,  appoL,  -aal  in  Malayalam  and  -
                    tification  of connectives and its arguments.                        aal, ataal in Tamil). 
                       Following  sections  are  organized  as  follows.                 Coordinators which connect two or more 
                                                                         403
                    Corpus Collection and Annotation is described in                     items  of  equal  syntactic  importance. 
                               They connect two independent clauses.                  In Tamil and Malayalam the connective “and” 
                               (For example: “aur”, “lekin” in Hindi, “-           exists in the form as in Example 3. In Hindi sin-
                               um”,     “ennaal”     in   Malayalam  and           gle lexicon “aur” serves this purpose. 
                               “anaal”, “athanaal” in Tamil).                      Example 3: 
                               Conjunct adverbs that connect two inde-             [muuttukaLiluLLa  kuRuththelumpu  vaLaraamal  
                               pendent  clauses  and  modify  the  clauses          in knee                  cartilage               without               
                               or  sentences  in  which  they  occur.  (For        theymaanam   atainthaalum]/arg1,  
                                                                                   growing wear  if get-and                    
                               example:  “isliye”,  “halaanki”  in  Hindi, 
                                                                                   [angkuLLa vazhuvazhuppaana thiravam                     
                               “athinaal”, “aakayaal” in Malayalam and 
                               “enninum”,  “aakaiyaal” in Tamil).                  there           smooth                    fluid 
                               Correlative     conjunctions      which     are     kuRainthupoonaalum]/arg2  muuttukaLil  uraayvu  
                               paired conjunctions. They link words or             get less-and                           knee             friction 
                               group  of  words  of  equal  weights  in  a         eRpatum. 
                               sentence. (For example: “na keval balki”            will develop 
                               in Hindi, “maathramalla-pakshe” in Mal-              (If  cartilage  in  the  knee  gets  wear  without         
                               ayalam and “mattumalla-aanaal” in Tam-              growing  and  if  the  smooth  fluid  present  there      
                               il).                                                becomes less, friction will  develop in the knee.) 
                     3.2     Position of Connectives                               3.3     Agglutinated and intra sentence 
                     In our approach we have done a syntactic based                In Malayalam and Tamil connectives can occur 
                     tagging.  In  Hindi,  Malayalam  and  Tamil  dis-             as free words or bound morphemes. But in Hindi 
                     course connectives can occur within a sentence                only free word connectives exist as in Example 
                     or between sentences. In all the three languages              2. 
                     inter  sentence  connectives  are  said  to  occupy           Example 4: 
                     sentence  initial  position.  Example  1  shows  the          [vayiRRil     kutalpun irunthaal]/arg1 [vayiRu     
                     inter sentence discourse relation in Malayalam.               In stomach  ulcer       is there-if          stomach           
                     Example 1:                                                    valikkum]/arg2. 
                     [chila  aaLukaL mukhsoundaryam koottaan                       will pain 
                     Some people     facial-beauty         increase                (If there is ulcer in stomach, stomach will pain.) 
                     kreemukaL upayogikkaaruNt.]/arg1                              3.4     Paired connectives 
                     creams         use 
                     ennaal [athu guNathekkaaLeRe     doshamaaN                    In Hindi some discourse connectives were seen 
                     But        that goodness-more than    harm-is                 as paired connectives. This type of connectives is 
                     cheyyuka.]/arg2                                               not noticed in Malayalam and Tamil. 
                     do                                                            Example 5: 
                     (Some people use creams to increase their facial              yadhii [lagaathaar buKaar aa rahaa hai]/arg1 tho  
                     beauty. But that will do more harm than good.)                if         constantly fever      coming  is            then 
                        We found that there exists a difference in the             [uskii jaaNca avashaya karaaye]/arg2. 
                     position  of  conjunct  adverb  “although”  among             its    check   sure          do   
                     the three languages. As in Example 2,  in Hindi               (If fever is coming constantly, then check it for       
                     this connective occurs in the sentence initial po-            sure.) 
                     sition whereas in Tamil and Malayalam this con-                  In  the  above  Example  5  “yadhii-to”  is  the 
                     nective occurs in the middle position and remains             paired connective that occurs at the start of arg1 
                     agglutinated with the verb.                                   and  arg2.  Whereas  in  Tamil  and  Malayalam  it 
                     Example 2:                                                    occurs as a  single  connective  as  in  Example  4 
                     haalaaMki [yoga pakshaaGaath kii samasyaa kaa                 and occurs agglutinated with verb. 
                     although     yoga paralysis               problem's           3.5     Arguments of Relations 
                     sTaayii       samaaDhaan karthaa hai]/arg2, 
                     permanent  solution          do         is                    In our approach the label assignment is syntactic. 
                     [yah samay lethaa hai evaM shramsaaDya                        Sometimes, the  arguments  can  be  in  the  same 
                     This time    take    is   and     painstaking                 sentence  as  the  connective.  Sometimes,  one  of 
                     hai]/arg1                                                     the preceding sentence acts as an argument. Also 
                     is                                                            the argument can be a non-adjacent sentence. But 
                     (Although  yoga  gives  a  permanent solution  for            the text span follows the minimality-principle. In 
                     paralysis, this is time taking and painstaking.)  404
                                                                                   Example 1 the connective “ennal” in Malayalam 
                    connects  two  discourse  units  inter  sententially.     sentences from the corpus for training and 1000 
                    The discourse unit that follows the connective is         sentences for  testing.  For  testing,  the  sentences 
                    arg2 and the preceding unit is arg1. In Example 4         are pre-processed similarly as training data. The 
                    the arguments for connective “-aal” in Tamil oc-          system identified the discourse markers in stage 
                    cur in same sentence.                                     1  and  this  output  becomes  input  to  stage  2.  In 
                    4    Automatic  identification  of  discourse             both  the  stages  we  used  CRFs  as  the  machine 
                                                                              learning algorithm.  
                         relation                                               The performance of our system is measured in 
                    4.1    Method Used                                        terms of Precision, Recall and F score. Precision 
                                                                              is  the  number  of  discourse  relations  correctly 
                    We have used the method adopted by Menaka et              perceived by the system from the total number of 
                    al., (2011) for the identification of discourse rela-     discourse relations identified, Recall is the num-
                    tions. We have preprocessed the text for morph            ber  of  discourse  relations  correctly  detected  by 
                    analysis  (Ram  et  al,  2010),  part-of-speech  tag-     the system by the total number of discourse rela-
                    ging (PoS) (Sobha et al, 2016), chunking (Sobha           tions  contained  in  the  input  text  and  F-score  is 
                    and  Ram,  2006),  clause  tagging  (Ram  et  al,         the harmonic mean of precision and recall. 
                    2012). The implementation is done based on ma-              The  results  for  connective  identification  are 
                    chine learning technique CRFs.                            tabulated in Table 1. 
                    4.2    Conditional Random Fields                             
                                                                                                      Precision   Recall   F-
                    CRFs is an undirected graphical  model,  where                                                         score 
                    the  conditional  probabilities  of  the  output  are       Hindi                 96.33       92.3     94.27 
                    maximized for a given input sequence. We chose              Malayalam             96.3        91.6     93.89 
                    CRFs, because it allows linguistic rules or condi-          Tamil                 95.35       94.18    94.76 
                    tions  to  be  incorporated  into  machine  learning        Table 1: Results for Connective Identification 
                    algorithm.  Here,  we  have  used  CRF++  (Kudo,            The argument identification  results  are  given 
                    2005),  an  open  source  toolkit  for  linear  chain     in Table 2, Table 3, Table 4 and Table 5. 
                    CRFs. 
                    4.3    Features Used                                                              Precision   Recall   F-
                    For  the  identification  of  connectives,  we  have                                                   score 
                    used  PoS  tagging  information,  morphological             Hindi                 76          72.2     74.05 
                    suffixes  and  clause  information  as  features  for       Malayalam             78.5        72       75.1 
                    Malayalam  and  Tamil.  Morphological  suffixes             Tamil                 81.53       73.6     77.36 
                    such as conditional markers, causal markers, rel-           Table 2: Results for ARG1 Start 
                    ative participle (RP) marker followed by postpo-                                  Precision   Recall   F-
                    sition (PSP) and coordination markers were used.                                                       score 
                    For connective identification in Hindi, word, PoS           Hindi                 75.9        72.2     74 
                    tagging information and chunk information were              Malayalam             78.8        72       75.23 
                    used. For argument identification we have taken             Tamil                 82          72.6     77 
                    PoS  tagging  information,  chunk  information,             Table 3: Results for ARG1 End 
                    morphological  suffixes,  and  clause  information, 
                    combination  of PoS and chunk information and                                     Precision   Recall   F-
                    connectives as features.                                                                               score 
                    4.4    Training and Testing                                 Hindi                 77.4        73.2     75.24 
                    For  identifying  the  discourse  connectives,  we          Malayalam             79.2        73       75.97 
                    trained the system using the features for connec-           Tamil                 81.5        72.6     76.79 
                    tives.  In  the  next  stage we  train  the  system  to     Table 4: Results for ARG2 Start 
                    identify the arguments and their text spans. Here 
                    we have built 4 language models for each of the                                   Precision   Recall   F-
                    4 boundaries – Arg2-START, Arg1-END, Arg1-                                                             score 
                    START and Arg2-END motivated by the work                    Hindi                 76.3        71.2     73.66 
                    of Menaka et al., (2011). The system was trained            Malayalam             78.7        72.4     75.42 
                                                                       405      Tamil                 82          72.7     77 
                    in 4 phases to develop 4 models.  We used 4000
The words contained in this file might help you see if this file matches what you are looking for:

...Cross linguistic variations in discourse relations among indian languages sindhuja gopalan lakshmi s sobha lalitha devi au kbc research centre mit campus of anna university chennai india sindhujagopalan lakssreedhar gmail com org the analysis explicit and abstract developed an automatic relation identi fication system during error various this paper summarizes our work on structural interdependencies were also noted rela tagging for hindi tions indo aryan language malayalam tamil has been done by et dravidian al other published works we have presented auto annotations are matic identifier which kolachina oza gave encouraging results re rachakonda sharma sults showed that some complex menaka their ter dependencies existed these three lan automatically identified causal guages described detail struc about interdepend tural inter occurred dis course thus encies exist between similarly exhibited nature due to observed existence interde pendencies introduction explained de tail from previou...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area