jagomart
digital resources
picture1_Language Pdf 101755 | Ling 367 Final Paper


 136x       Filetype PDF       File size 0.12 MB       Source: aryamanarora.github.io


File: Language Pdf 101755 | Ling 367 Final Paper
conjunctverbsinpunjabiandacrossindo aryan acorpusstudy aryamanarora georgetown university aa2190 georgetown edu abstract genre doc sent tok misc 71 1664 i introduce a new universal dependencies cor news 3 71 1274 pus for ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
                             ConjunctverbsinPunjabiandacrossIndo-Aryan: acorpusstudy
                                                                            AryamanArora
                                                                        Georgetown University
                                                                     aa2190@georgetown.edu
                                               Abstract                                                Genre           Doc.     Sent.    Tok.
                                                                                                       misc               —        71    1664
                          I introduce a new Universal Dependencies cor-
                                                                                                       news                3       71    1274
                          pus for Punjabi and investigate the syntactic
                                                                                                       editorial           1       39     762
                          behaviour of conjunct verbs across the Indo-
                                                                                                       blog                1       33     806
                          Aryanfamily. I find evidence of conjunct com-
                                                                                                       Total               5     214     4506
                          ponent ‘stickiness’ from corpus data that sup-
                          ports the treatment of conjunct verbs as a sin-                 Table 1: Data in the Punjabi UD corpus by genre.
                          gle constituent. The work is a step towards bet-                 Columnsare‘documents’, ‘sentences’, and ‘tokens’.
                          ter coverage of UD in Indo-Aryan and further
                          investigation of comparative and historical lin-
                                                                                           ent from other verbal arguments and is it actually
                          guistic questions.
                                                                                           sensible to treat ADJ and NOUN hosts as a single
                     1     Introduction
                                                                                           class, as many works do?
                     Punjabi is the language spoken in the ‘land of fiver
                                                                                           2    Designing a Punjabi corpus
                     rivers’, a historical area around the tributaries of
                     the Indus river now partitioned into the Punjab ad-                   For the purpose of having a broader selection
                     ministrative regions in India and Pakistan respec-                    of Indo-Aryan languages to examine, I created
                     tively. It has over 100 million native speakers. The                  a syntactically-annotated Universal Dependencies
                     prestige dialect of Punjabi is Majhi (lit. middle),                  (Nivre et al., 2016, 2020) corpus for Punjabi in
                                                                                                                      2
                     associated with the cities of Lahore, Pakistan and                    the Gurmukhi script.           While the corpus is rela-
                     Amristar, India.                                                      tively small, it covers several genres of text (news,
                         Punjabi is an Indo-Aryan (IA) language. Indo-                     editorial,andblog)andisofmuchhigherqual-
                     Aryan is unique among language families to have                       ity than existing large treebanks for Indo-Aryan
                     both immense diversity in the modern period as                        languages due to being hand-annotated.
                     wellasacontinuouslyattestedhistoryofmorethan
                                                                                           2.1    Text composition
                     3,000 years since the attestation of Vedic Sanskrit.
                     This makes it very exciting for work on compar-                      Table 1 shows the breakdown of text in the corpus.
                     ative and historical linguistics, and computational                   Giventhelimitedtimeforthefinalproject,Ipriori-
                     methods are necessary given the vast number of                        tised text diversity instead of having a large corpus
                     texts. Unfortunately, there are large gaps in avail-                  of a single kind of text (which would have been
                     ability of labelled data for this depth and breadth.                  easier to annotator given intra-genre language con-
                         The contribution of this paper is two-fold: I de-                ventions). Ifoundtextsonmyownandvettedthem
                     sign and annotate a Punjabi Universal Dependen-                       manually for quality before annotation.
                     cies corpus, and using it and other existing UD                       Why not use existing corpora?                  There are al-
                     corpora for Indo-Aryan languages I investigate the                    ready several Punjabi corpora for NLP applica-
                     properties of conjunct verbs, which are NOUN-                         tions. The largest one is IndicCorp with 773 mil-
                     VERB and ADJ-VERB constructions that behave as                        lion tokens (Kakwani et al., 2020). For unlabelled
                                                    1
                     one morphological unit. Namely, I ask: does cor-                      data, Punjabi is no low-resourced language. How-
                     pus data affirm that the host is syntactically differ-                 ever, after annotating a small portion of data from
                         1
                                                                                           kindofcomplexpredicate,theothermainsubtypeinIAbeing
                          Aterminologicalnote: Theverbcomponentofaconjunct
                                                                                          VERB-VERBconstructions.
                     verb is called the light verb, and the other component (regard-
                                                                                              2
                     less of part of speech) is called the host. Conjunct verbs are a          Released here.
                                                                                         obl
                                                                                                                          root
                                                                                                 nsubj
                                                                                                              obj
                                                                                                         case
                                                                                 case                                           punct
                                                   det       case
                                              ਇਸ        ਚੋਣ      ਿਵੱਚ    ਜਠਾਣੀ        ਨੇ         ਦਰਾਣੀ         ਨੂੰ     ਹਰਾਇਆ          ।
                                                                           h                                     n
                                               is      coṇ       vicc   jaṭ āṇī       ne        darāṇī        nū        harāiā        .
                                              this   election     in    eld. sis.   (ERG)     young. sis.    (ACC)     defeated       .
                                              DET      NOUN      ADP      NOUN       ADP         NOUN         ADP        VERB      PUNCT
                      Figure 1: A Universal Dependencies-annotated sentence (id news_bbc_inlaw_25) from my Punjabi corpus. An
                      English translation is “In this election, the elder sister-in-law defeated the younger sister-in-law.”
                      IndicCorp, it became apparent that the text was                            Lang.         Ref.                           Sent.       Tok.
                      low-quality, and an uncomfortably large portion
                                                                                                 Hindi         Tandon et al. (2016)           17.6k     375.5k
                                                                                                 Urdu          Bhat and Sharma (2012)          5.1k     138.1k
                      of the source data could be traced back to spam
                                                                                  3              Magahi        —                               0.6k       7.7k
                      websites advertising questionable products.                     In-
                                                                                                 Bhojpuri      Ojha and Zeman (2020)           0.4k       6.7k
                      dicCorp also tosses out document-level structure,
                                                                                                 Punjabi       this work                       0.2k       4.5k
                                                                                                 Marathi       Ravishankar (2017)              0.5k       3.5k
                      while coherent documents could be useful to have
                                                                                                 Kangri        —                               0.3k       2.5k
                      for future multilayer annotation.
                                                                                                 Odia          —                              0.05k       0.4k
                          However, I did find some more carefully col-                           Bengali       —                              0.06k       0.3k
                      lected corpora. The FLORES-101 low-resource
                                                                                               Table 2: New Indo-Aryan UD corpora. (Sindhi UD
                      machine-translation dataset (Goyal et al., 2021),
                                                                                               is excluded because there it has no dependency struc-
                      PMIndia(HaddowandKirefu,2020)andEMILLE
                                                                                               tures.)
                      (McEneryetal.,2000;Bakeretal.,2002)willeven-
                      tually be incorporated. I wanted more direct con-
                                                                                                               4                      5
                                                                                               larly HDTB andHindiPUD ).TheUniversalDe-
                      trol over text genres though, so only small parts of
                                                                                               pendenciescommunityalsohelpeddealwithsome
                      FlOREShavebeenincorporated so far.
                                                                                                                                       6
                                                                                               linguistic issues in annotation.
                      2.2     Annotation
                                                                                                  As a heritage speaker of Punjabi and a native
                                                                                               speaker of the closely-related Hindi–Urdu, I also
                      I annotated POS (part-of-speech) tags and depen-
                                                                                               had sufficient experience with the language to be
                      dencyrelations following the Universal Dependen-
                                                                                               able to analyse constructions that have not been de-
                      ciesschema. Morphologicalfeatureshavenotbeen
                                                                                               scribed in grammars.
                      annotated yet, but will be in a semi-automated
                      fashion eventually.          To annotate I used UD An-
                                                                                               2.3    OtherIAcorpora
                      notatrix, a locally-hosted tool for editing conllu
                                                                                               OutoftheNewIndo-Aryan(NIA)languages,only
                      dependency trees (Tyers et al., 2017).                      Texts
                                                                                               9 have active UD corpora with annotations, with
                      were segmented into sentences manually and to-
                                                                                               this new Punjabi corpus being the tenth. Their
                      kenised by whitespace, with further manual cor-
                                                                                               sizes are listed in table 2.
                      rections. Each document is named by its genre,
                      source, and a unique one-word identifier, e.g.
                                                                                               3    Conjunctverbs
                      news_bbc_rajnikanthisanewsarticlefromthe
                      BBCabout South Indian actor Rajnikanth’s entry
                                                                                               Conjunct verbs are an areal phenomenon of (but
                      into politics.
                                                                                               not exlusively of) the South Asian region, being
                          I relied on reference dictionaries (RCPLT, 2021;
                                                                                               found in both the Indo-Aryan and Dravidian fam-
                      Singh, 1895) and grammars (Bhatia, 1993; Gill
                                                                                               ilies (Puttaswamy, 2018). For this corpus study,
                      andGleason,2013)todesigntheannotationguide-
                                                                                                  4
                                                                                                    Hindi Dependency Treebank
                      lines, and also referred to other treebanks (particu-
                                                                                                  5
                                                                                                    Parallel Universal Dependencies
                                                                                                  6
                          3
                                                                                                    The GitHub issues I created all dealt with copular con-
                           For example: https://pa.eferrit.com/. I am un-
                                                                                               structions: Copula with clausal argument, What even is a cop-
                      abletounderstandwhatthepurposeofthesetypesofwebsites
                      is, but all the articles felt machine-translated and unsuitable          ula, CopulasbesidesਹੋਣਾinPunjabi, ADJ+ਹੋਣਾcompoundsin
                      for annotation.                                                          Punjabi.
                   two classes of conjunct verbs are under consider-              the adjectival conjunct construction. Meanwhile,
                   ation, exemplified below in Punjabi:                           (4) is just an attributive copular construction, but
                                                                                  it uses the same verb hoṇā in the predicate as the
                                n
                      (1)   mai ne bataur ekṭar naukrī          kītī .
                                                            host    lv
                                                                                  intransitive conjunct verb. Also note the existence
                            I     ERGas       actor career      did.
                                                                                  ofotherverbswhichcanbehaveasattributivecopu-
                            ‘I had a job as an actor.’
                                                                                  lae, such as baṇnā ‘to become’, rahiṇā ‘to remain/-
                                n                n
                      (2)   mai ne kamrenū sāf            kītā .
                                                      host     lv                 continue to be’.
                            I     ERGroom ACCclean did.
                                                                                     Why then do we analyse adjectival conjunct
                            ‘I cleaned the room.’
                                                                                  verbs as conjuncts in the first place?         Why not
                                                                                  treat the whole class of verbs (including transi-
                   The host is the element providing the semantics
                                                                                  tive karnā)astakingapredicativecomplement,de-
                   and much of the argument structure of the con-
                                                                                  scribed under Universal Dependencies as xcomp?
                   junctverbconstruction,andthechoiceoflightverb
                                                                                  I will investigate the available UD corpora to gain
                   merely indicates transitivity and provides tense-
                                                                                  some more evidence about the properties of con-
                   aspect-mood information. (1) has a NOUN host and
                                                                                  junct verbs.
                   (2) has an ADJ host.
                      Extensive theoretical linguistic work on IA con-
                                                                                  4    Analysis
                   junct verbs (Burton-Page, 1957; Hacker, 1961;
                   Kachru, 1982; Mohanan, 1994, 1995; Vaidya,
                                                                                  In all NIA language UD corpora, conjunct verbs
                   2015; Montaut, 2016; Fatma, 2018) has led to
                                                                                  use the dependency relation compound or its sub-
                   agreement on the following points:
                                                                                  type compound:lvc (which I followed in Pun-
                   1. The host does not take case marking or other
                                                                                  jabi).  To run all analyses I used Python scripts,
                   modifiers (e.g. determiners in the case it is a noun).
                                                                                  the conllu package for parsing UD corpora, and
                   2. The host is an argument to the verb, as evi-
                                                                                  plotnineforgraphs.
                   denced by agreement, but at the same time forms
                                                                                  4.1   Claim1: Hosts in conjunct verbs stick
                   a morphological unit with the verb (evidenced by
                   limitations on movement).
                                                                                  InNewIndo-Aryanlanguages,consistuentorderis
                   3. Boththehostandthelightverbplayaroleinthe
                                                                                  discourse-configurational, i.e. it is ‘free’ but SOV
                                                                                                7
                   argumentstructureoftheclause,butthesemantics
                                                                                  is unmarked and other orderings of constituents
                   are largely provided by the host.
                                                                                  are conditioned by pragmatic considerations and
                      This also provides an easy diagnostic for
                                                                                  topicalisation.
                   whethersomethingisaconjunctverbconstruction.
                                                                                     One common claim is that conjunct verb hosts
                                                                                  are ‘sticky’; they cannot move in the sentence with
                   3.1    Adjectival conjunct verbs
                                                                                  the same flexibility as actual semantic arguments.
                   However, much of the theoretical work focuses on
                                                                                  Mohanan(1994)categoricallyclaimsthatinHindi
                   noun hosts to the detriment of adjectives; e.g. Mo-
                                                                                  the host can never detach from the light verb. This
                   hanan (1994) assumes all discussion of noun con-
                                                                                  is claimed to be evidence that they form a single
                   juncts applies to adjectives. The following exam-
                                                                                  morphological unit, since per usual syntactic ten-
                   ples illustrate issues in the syntactic analysis of ad-
                                                                                  denciesinHindiverbalargumentsarefreetomove.
                   jectival conjunct verbs:
                                                                                     To check whether conjunct hosts are ‘stickier’
                                                                                  thandirectobjects(obj,dobj),Ifirstcheckedhow
                                     n
                      (3)    a. mai ne kamrāsāf           kītā .
                                                      host     lv
                                                                                  far each direct object was from its expected po-
                                 I     ERGroom clean did.
                                                                                  sition immediately before the verb, ignoring con-
                                 ‘I cleaned the room.’
                                                                                  junct hosts. Example measurements (in italics is
                             b. kamrāsāf        hoiā .
                                            host     lv
                                                                                  the direct object):
                                 room clean became
                                                                                               n                 h
                                 ‘Theroomwascleaned[bysomeone].’
                                                                                     (5)   mai nekamrāvek iā .               (distance: 0)
                                                                                                                     v
                                                                                               n
                      (4)   kamrā sāf     hai.
                                                                                     (6)   mai nekamrāsāf            kītā .              (0)
                                                                                                                host      lv
                            room clean is
                                                                                                        n
                                                                                     (7)   kamrā[mai ne]sāf            kītā .            (1)
                                                                                                                  host      lv
                            ‘The room is clean.’
                                                                                     7
                                                                                      In KashmiriandsomeothermorenorthernIAlanguages,
                   In (3), we can use different light verbs (karnā ‘to
                                                                                  V2 word order is unmarked instead, but in our sample only
                   do’ and hoṇā ‘to be’) to change the transitivity of            SOV-unmarkedlanguages are represented.
                                     Treebank       Obj       n   Host(NOUN)      m Host(ADJ)         k
                                     Bengali        0.09     22          0.00     3          —      —
                                     Bhojpuri       0.47     55          0.77   347         1.18    38
                                     Hindi (HDTB)   0.35  10378          0.10  8463         0.05  4813
                                     Hindi (PUD)    0.21   1154          0.06   224         0.02   219
                                     Magahi         0.37    385          0.08    36          —      —
                                     Kangri         0.40     63          0.18    57         0.08    12
                                     Marathi        0.27    181          0.04    27         0.00     5
                                     Odia           0.63     43          1.17    18          —      —
                                     Punjabi        0.28    151          0.01    69         0.05    57
                                     Urdu           0.38   4061          0.09  4561         0.06  2486
                 Table 3: Mean distance of objects (ignoring hosts) and conjunct hosts from their head verb across NIA languages.
                 Red indicates a non-significant difference. Bold indicates a statistically significant different in the opposite of
                 expected direction: objects are ‘stickier’. Rest are significant for hosts being stickier at p < 0.05.
                             n
                   (8)   mai nesāf      kītā   kamrā.            (1)    and my line of argumentation in §3.1 (that adjec-
                                     host    lv
                                                                        tival conjuncts might be better analysed as actual
                 Then I calculated the same distances for conjunct
                                                                        arguments) is not really supported by data, since
                 hosts. To see if there is a statistically significant
                                                                       wewouldexpectargumentstobemoremobile. So,
                 differencebetweenobjectsandpredicativecomple-
                                                                        this claim is not supported.
                 ments vs. conjunct hosts, I ran a permutation test
                 (with 1,000 permutations) to compare mean dis-
                                                                        5   Limitations
                 tances.
                   Results are shown in table 3, with figures
                                                                       Amajor limitation of this study is that I have not
                 for only NOUN comparisons in the appendix (ap-
                                                                        been able to test the other major property of con-
                 pendix A). In almost all Indo-Aryan languages,
                                                                        junct verbs: the contribution of hosts to argument
                 conjuncthostsareindeedsignificantlystickierthan
                                                                        structure. I do think this is feasible with the corpus
                 objects.  In Bengali for NOUNs and Kangri for
                                                                        study but I fear the limited coverage of infrequent
                 ADJs the difference is not significant, likely due to
                                                                        lexemes will make it harder to study with these an-
                 small sample size. In Odia for NOUNs the result is
                                                                        notated UD corpora—and I am limited in space.
                 flipped, but again the sample size is small. How-
                                                                          Also, I have poor coverage of languages here be-
                 ever, in Bhojpuri there is both a decent sample size
                                                                        sides Hindi–Urdu in both theoretical background
                 and non-significant difference in distance for both
                                                                        and corpus data; of course, one contribution of
                 typesofconjuncts,indicatingsyntacticdifferences
                                                                        mine is the Punjabi UD corpus which is one step
                 from the rest of Indo-Aryan that are worth investi-
                                                                        towards improving breadth in UD.
                 gating. Generally though, I find this claim upheld
                 by the data.
                                                                        6   Conclusion
                 4.2   Claim2: Predicative complements aren’t
                                                                        Syntacticallyannotatedcorporaenablethestudyof
                       sticky
                                                                        manyinteresting questions in Indo-Aryan compar-
                 Unfortunately, in all the treebanks the number of      ativelinguistics, andtheyhavenotbeenadequately
                 adjectival predicative complements (ADJ with de-       employed for that purpose or developed to cover
                 prelxcomp)wasquitesmall. Inthetwolargesttree-          the family well. This paper presents both a new
                 banks (Hindi-HDTB and Urdu) I was able to run          UDcorpus for Punjabi, a low-resourced language
                 sensible permutation tests since there was enough      by NLP standards, and investigates the syntactic
                 data. With 320 xcomp to test against in Hindi and      behaviourofconjunctverbsacrossIndo-Aryanlan-
                 195inUrdu,astatistically significant greater stick-    guages.
                 iness of adjectival hosts was indeed found. The av-      I plan on expanding the Punjabi UD corpus to
                 erage distance of xcomp was close to 0, but the dif-   covermoregenres(epseciallypoetryandsocial
                 ference was there—perhaps xcomp arguments can          media) and adding morphological feature annota-
                 be moved freely but due to rarity stay in the un-      tions. I also want to expand coverage of other Indo-
                 marked position.                                      Aryan languages—likely next candidates are Sin-
                   Thissuggeststhereisactuallysomethingspecial          hala and Sindhi.
                 about adjectival hosts with respect to stickiness,
The words contained in this file might help you see if this file matches what you are looking for:

...Conjunctverbsinpunjabiandacrossindo aryan acorpusstudy aryamanarora georgetown university aa edu abstract genre doc sent tok misc i introduce a new universal dependencies cor news pus for punjabi and investigate the syntactic editorial behaviour of conjunct verbs across indo blog aryanfamily find evidence com total ponent stickiness from corpus data that sup ports treatment as sin table in ud by gle constituent work is step towards bet columnsare documents sentences tokens ter coverage further investigation comparative historical lin ent other verbal arguments it actually guistic questions sensible to treat adj noun hosts single introduction class many works do language spoken land fiver designing rivers area around tributaries indus river now partitioned into punjab ad purpose having broader selection ministrative regions india pakistan respec languages examine created tively has over million native speakers syntactically annotated prestige dialect majhi lit middle nivre et al associa...

no reviews yet
Please Login to review.