Processing Pdf 179323

Partial capture of text on file.
                                                                         Sta nz a : A Python Natural Language Processing Toolkit
                                                                                                                   for ManyHumanLanguages
                                                                                                          PengQi* YuhaoZhang*                                                              YuhuiZhang
                                                                                                            Jason Bolton                               Christopher D. Manning
                                                                                                                                          Stanford University
                                                                                                                                          Stanford, CA 94305
                                                                                    {pengqi, yuhaozhang, yuhuiz}@stanford.edu
                                                                                                      {jebolton, manning}@stanford.edu
                                                                                       Abstract                                                                                                                                   Hello!         Bonjour!          你好!             Hallo!
                                                                                                                                                                             Tokenization & Sentence Split
                                                                                                                                                                                                                                    EN              FR              ZH              DE
                                                                                                                                                                                        TOKENIZE
                                                                                                                                                                                                                                                                                 Здравствуйте!
                                                 Weintroduce Sta nz a , an open-source Python                                                                                                                                      ﺎﺑﺣرﻣ!       안녕하세요!             ¡Hola!
                                                                                                                                                                                                                                    AR              KO              ES              RU
                                                                                                                                                                              Multi-word Token Expansion
                                                 natural language processing toolkit support-                                                                                              MWT                                  こんにちは！             Hallo!        xin chào!        नमस्कार!
                                                                                                                                                                                                                                    JA              NL              VI              HI
                                                 ing 66 human languages.                                        Compared to ex-                                                       Lemmatization                                          Multilingual: 66 Languages
                                                                                                                                                                                          LEMMA
                                                 isting widely used toolkits, Sta nz a                                              features                                                                                                          RAW TEXT
                                                 a language-agnostic fully neural pipeline for                                                                               POS & Morphological Tagging
                                                                                                                                                                                            POS
                                                 text analysis, including tokenization, multi-                                                                                                                                              WORDS
                                                                                                                                                                                                                                                             Native Python Objects
                                                                                                                                                                                  Dependency Parsing
                                                                                                                                                                                                                                            TOKEN
                                                 word token expansion, lemmatization, part-of-                                                                                         DEPPARSE
                                                                                                                                                                                                                                        LEMMA      POS       HEAD     DEPREL        ...
                                                 speech and morphological feature tagging, de-                                                                                  Named Entity Recognition
                                                                                                                                                                                                                                                            WORD
                                                 pendency parsing, and named entity recogni-                                                                                                NER
                                                                                                                                                                                                                                                      SENTENCE
                                                 tion.         We have trained Sta nz a on a total of                                                                     Fully Neural: Language-agnostic
                                                 112 datasets, including the Universal Depen-                                                                                       PROCESSORS                                                       DOCUMENT
                                                 dencies treebanks and other multilingual cor-                                                                          Figure 1: Overview of Sta nz a ’s neural NLP pipeline.
                                                 pora, and show that the same neural architec-                                                                          Sta nz a         takes multilingual text as input, and produces
                                                 ture generalizes well and achieves competitive                                                                         annotations accessible as native Python objects. Be-
                                                 performanceonalllanguagestested. Addition-                                                                             sides this neural pipeline, Sta nz a                                                also features a
                                                 ally, Sta nz a includes a native Python interface                                                                      Python client interface to the Java CoreNLP software.
                                                 to the widely used Java Stanford CoreNLP
                                                 software, which further extends its function-                                                                          ing downstream applications and insights obtained
                                                 ality to cover other tasks such as coreference                                                                         fromthem. Third, sometoolsassumeinputtexthas
                                                 resolution and relation extraction.                                                  Source                            been tokenized or annotated with other tools, lack-
                                                 code, documentation, and pretrained models                                                                             ing the ability to process raw text within a uniﬁed
                                                 for 66 languages are available at https://
                                                 stanfordnlp.github.io/stanza/.                                                                                         framework. This has limited their wide applicabil-
                                                                                                                                                                        ity to text from diverse sources.
                                        1        Introduction                                                                                                                 WeintroduceSta nz a 2, a Python natural language
                                        Thegrowingavailabilityofopen-sourcenaturallan-                                                                                  processing toolkit supporting many human lan-
                                        guage processing (NLP) toolkits has made it easier                                                                              guages. As shown in Table 1, compared to existing
                                        for users to build tools with sophisticated linguistic                                                                          widely-usedNLPtoolkits, Sta nz a has the following
                                        processing. While existing NLP toolkits such as                                                                                 advantages:
                                        CoreNLP (Manning et al., 2014), FLAIR (Akbik                                                                                          • From raw text to annotations. Sta nz a fea-
                                                                                       1
                                        et al., 2019), spaCy , and UDPipe (Straka, 2018)                                                                                            tures a fully neural pipeline which takes raw
                                        have had wide usage, they also suffer from several                                                                                          text as input, and produces annotations includ-
                                        limitations. First, existing toolkits often support                                                                                         ing tokenization, multi-word token expansion,
                                        only a few major languages. This has signiﬁcantly                                                                                           lemmatization, part-of-speech and morpholog-
                                        limited the community’s ability to process multilin-                                                                                        ical feature tagging, dependency parsing, and
                                        gual text. Second, widely used tools are sometimes                                                                                          namedentity recognition.
                                        under-optimized for accuracy either due to a focus                                                                                    • Multilinguality.                                 Sta nz a ’s architectural de-
                                        on efﬁciency (e.g., spaCy) or use of less power-                                                                                            sign is language-agnostic and data-driven,
                                        ful models (e.g., CoreNLP), potentially mislead-                                                                                           which allows us to release models support-
                                              ∗Equal contribution. Order decided by a tossed coin.
                                               1https://spacy.io/                                                                                                              2The toolkit was called StanfordNLP prior to v1.0.0.
                                                                                                                                                               101
                                                           Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 101–108
                                                                                                                                         c
                                                                                          July 5 - July 10, 2020. 
2020 Association for Computational Linguistics
                              System           # Human           Programming           RawText              Fully          Pretrained        State-of-the-art
                                               Languages           Language           Processing           Neural            Models            Performance
                              CoreNLP               6                 Java                 ✦                                    ✦
                              FLAIR                12               Python                                   ✦                  ✦                   ✦
                              spaCy                10               Python                 ✦                                    ✦
                              UDPipe               61                 C++                  ✦                                    ✦                   ✦
                              Sta nz a             66               Python                 ✦                 ✦                  ✦                   ✦
                                Table 1: Feature comparisons of Sta nz a against other popular natural language processing toolkits.
                              ing 66 languages, by training the pipeline on                         (fr) L’Association des Hôtels
                              the Universal Dependencies (UD) treebanks                             (en) The Association of Hotels
                                                                                                    (fr) Il y a des hôtels en bas de la rue
                              and other multilingual corpora.                                       (en) There are hotels down the street
                           • State-of-the-art performance. We evaluate                           Figure 2: An example of multi-word tokens in French.
                              Sta nz a on a total of 112 datasets, and ﬁnd its                   Thedesintheﬁrst sentence corresponds to two syntac-
                              neural pipeline adapts well to text of different                   tic words, de and les; the second des is a single word.
                              genres, achieving state-of-the-art or competi-                     Tokenization and Sentence Splitting.                        When
                              tive performance at each step of the pipeline.
                                                                                                 presented raw text, Sta nz a tokenizes it and groups
                           Additionally, Sta nz a features a Python interface                    tokens into sentences as the ﬁrst step of processing.
                       to the widely used Java CoreNLP package, allow-                           Unlike most existing toolkits, Sta nz a combines tok-
                       ing access to additional tools such as coreference                        enization and sentence segmentation from raw text
                       resolution and relation extraction.                                       into a single module. This is modeled as a tagging
                           Sta nz a  is fully open source and we make pre-                       problemovercharactersequences,wherethemodel
                       trained models for all supported languages and                            predicts whether a given character is the end of a
                       datasets available for public download. We hope Sta                       token, end of a sentence, or end of a multi-word
                                                                                                                                       3
                       nz a can facilitate multilingual NLP research and ap-                     token (MWT, see Figure 2). We choose to predict
                       plications, and drive future research that produces                       MWTsjointlywithtokenization because this task
                       insights from human languages.                                            is context-sensitive in some languages.
                       2     SystemDesignandArchitecture                                         Multi-word Token Expansion.                      Once MWTs
                                                                                                 are identiﬁed by the tokenizer, they are expanded
                       Atthetoplevel, Sta nz a consists of two individual                        into the underlying syntactic words as the basis
                       components: (1) a fully neural multilingual NLP                           of downstream processing. This is achieved with
                       pipeline; (2) a Python client interface to the Java                       an ensemble of a frequency lexicon and a neural
                       Stanford CoreNLP software. In this section we                             sequence-to-sequence (seq2seq) model, to ensure
                       introduce their designs.                                                  that frequently observed expansions in the training
                       2.1     Neural Multilingual NLP Pipeline                                  set are always robustly expanded while maintaining
                                                                                                 ﬂexibility to model unseen words statistically.
                       Sta nz a ’s neural pipeline consists of models that                       POSandMorphologicalFeatureTagging. For
                       range from tokenizing raw text to performing syn-                         each word in a sentence, Sta nz a assigns it a part-
                       tactic analysis on entire sentences (see Figure 1).                       of-speech (POS), and analyzes its universal mor-
                       All componentsaredesignedwithprocessingmany                               phological features (UFeats, e.g., singular/plural,
                       humanlanguagesinmind,withhigh-level design                                 st   nd   rd
                       choices capturing common phenomena in many                                1 /2 /3 person,etc.). TopredictPOSandUFeats,
                       languages and data-driven models that learn the dif-                      we adopt a bidirectional long short-term mem-
                       ference between these languages from data. More-                          ory network (Bi-LSTM) as the basic architecture.
                       over, the implementation of Sta nz a components is                        For consistency among universal POS (UPOS),
                       highly modular, and reuses basic model architec-                              3Following Universal Dependencies (Nivre et al., 2020),
                       tures when possible for compactness. We highlight                         wemakeadistinction between tokens (contiguous spans of
                       the important design choices here, and refer the                          characters in the input text) and syntactic words. These are
                                                                                                 interchangeable aside from the cases of MWTs, where one
                       reader to Qi et al. (2018) for modeling details.                          token can correspond to multiple words.
                                                                                            102
                              treebank-speciﬁc POS (XPOS), and UFeats, we                                                       existing server interface in CoreNLP, and imple-
                              adopt the biafﬁne scoring mechanism from Dozat                                                    ment a robust client as its Python interface.
                              and Manning (2017) to condition XPOS and                                                              WhentheCoreNLPclient is instantiated, Sta nz
                              UFeats prediction on that of UPOS.                                                                a  will automatically start the CoreNLP server as a
                                                                                                                                local process. The client then communicates with
                              Lemmatization. Sta nz a also lemmatizes each                                                      the server through its RESTful APIs, after which
                              word in a sentence to recover its canonical form                                                  annotationsaretransmittedinProtocolBuffers,and
                              (e.g., did→do). Similar to the multi-word token ex-                                               converted back to native Python objects. Users can
                              pander, Sta nz a ’s lemmatizer is implemented as an                                               also specify JSON or XML as annotation format.
                              ensemble of a dictionary-based lemmatizer and a                                                  Toensurerobustness, while the client is being used,
                              neural seq2seq lemmatizer. An additional classiﬁer                                                Sta nz a periodically checks the health of the server,
                              is built on the encoder output of the seq2seq model,                                              and restarts it if necessary.
                              to predict shortcuts such as lowercasing and iden-
                              tity copy for robustness on long input sequences                                                  3      SystemUsage
                              such as URLs.
                                                                                                                                Sta nz a ’s user interface is designed to allow quick
                              Dependency Parsing. Sta nza parses each sen-                                                      out-of-the-box processing of multilingual text. To
                              tence for its syntactic structure, where each word                                                achieve this, Sta nz a                 supports automated model
                              in the sentence is assigned a syntactic head that                                                 download via Python code and pipeline customiza-
                              is either another word in the sentence, or in the                                                 tion with processors of choice. Annotation results
                              case of the root word, an artiﬁcial root symbol. We                                               can be accessed as native Python objects to allow
                              implement a Bi-LSTM-based deep biafﬁne neural                                                     for ﬂexible post-processing.
                              dependencyparser(DozatandManning,2017). We                                                        3.1       Neural Pipeline Interface
                              further augment this model with two linguistically
                              motivated features: one that predicts the lineariza-                                              Sta nz a ’s neural NLP pipeline can be initialized
                              tion order of two words in a given language, and                                                 with the Pipeline class, taking language name
                              the other that predicts the typical distance in linear                                            as an argument. By default, all processors will be
                              order between them. We have previously shown                                                      loaded and run over the input text; however, users
                              that these features signiﬁcantly improve parsing                                                  canalsospecifytheprocessorstoloadandrunwith
                              accuracy (Qi et al., 2018).                                                                       a list of processor names as an argument. Users
                              NamedEntityRecognition. Foreachinputsen-                                                          can additionally specify other processor-level prop-
                              tence, Sta nz a also recognizes named entities in it                                              erties, such as batch sizes used by processors, at
                              (e.g., person names, organizations, etc.). For NER                                                initialization time.
                              weadoptthecontextualized string representation-                                                       Thefollowing code snippet shows a minimal us-
                              based sequence tagger from Akbik et al. (2018).                                                   age of Sta nz a for downloading the Chinese model,
                              Weﬁrsttrain a forward and a backward character-                                                   annotating a sentence with customized processors,
                              level LSTM language model, and at tagging time                                                    and printing out all annotations:
                              we concatenate the representations at the end of                                                    import stanza
                              each word position from both language models                                                        # download Chinese model
                                                                                                                                  stanza.download(’zh’)
                              with word embeddings, and feed the result into a                                                    # initialize Chinese neural pipeline
                                                                                                                                  nlp = stanza.Pipeline(’zh’, processors=’tokenize,
                              standard one-layer Bi-LSTM sequence tagger with                                                             pos,ner’)
                              a conditional random ﬁeld (CRF)-based decoder.                                                      # run annotation over a sentence
                                                                                                                                  doc = nlp(’斯坦福是一所私立研究型大学。’)
                                                                                                                                  print(doc)
                              2.2       CoreNLPClient                                                                               After all processors are run, a Document in-
                              Stanford’s Java CoreNLP software provides a com-                                                  stance will be returned, which stores all annotation
                              prehensive set of NLP tools especially for the En-                                                results. Within a Document, annotations are fur-
                              glish language. However, these tools are not easily                                               ther stored in Sentences, Tokens and Words
                              accessible with Python, the programming language                                                  in a top-down fashion (Figure 1). The following
                              of choice for many NLP practitioners, due to the                                                  code snippet demonstrates how to access the text
                              lack of ofﬁcial support. To facilitate the use of                                                 and POS tag of each word in a document and all
                              CoreNLPfromPython, we take advantage of the                                                       namedentities in the document:
                                                                                                                         103
                         # print the text and POS of all words
                         for sentence in doc.sentences:
                              for word in sentence.words:
                                   print(word.text, word.pos)
                         # print all entities in the document
                         print(doc.entities)
                           Sta nz a  is designed to be run on different hard-
                       ware devices. By default, CUDA devices will be
                       used whenever they are visible by the pipeline, or
                       otherwise CPUs will be used. However, users can
                       force all computation to be run on CPUs by setting
                       use_gpu=Falseatinitialization time.
                       3.2     CoreNLPClientInterface
                       TheCoreNLPclientinterface is designed in a way
                       that the actual communication with the backend                            Figure 3: Sta nz a annotates a German sentence, as vi-
                       CoreNLPserver is transparent to the user. To an-                          sualized by our interactive demo. Note am is expanded
                       notate an input text with the CoreNLP client, a                           into syntactic words an and dem before downstream
                       CoreNLPClientinstanceneedstobeinitialized,                                analyses are performed.
                       with an optional list of CoreNLP annotators. After                        Anexample of running Sta nz a on a German sen-
                       the annotation is complete, results will be accessi-                      tence can be found in Figure 3.
                       ble as native Python objects.
                           This code snippet shows how to establish a                            3.4     Training Pipeline Models
                       CoreNLP client and obtain the NER and corefer-                            For all neural processors,                   Sta nz a    provides
                       ence annotations of an English sentence:                                  command-line interfaces for users to train their
                         from stanza.server import CoreNLPClient                                 own customized models. To do this, users need
                         # start a CoreNLP client                                                to prepare the training and development data in
                         with CoreNLPClient(annotators=[’tokenize’,’ssplit                       compatible formats (i.e., CoNLL-U format for the
                               ’,’pos’,’lemma’,’ner’,’parse’,’coref’]) as                        Universal Dependencies pipeline and BIO format
                               client:
                              # run annotation over input                                        column ﬁles for the NER model). The following
                              ann = client.annotate(’Emily said that she
                               liked the movie.’)                                                commandtrains a neural dependency parser with
                              # access all entities
                              for sent in ann.sentence:                                          user-speciﬁed training and development data:
                                   print(sent.mentions)
                              # access coreference annotations
                              print(ann.corefChain)                                                $ python -m stanza.models.parser \
                                                                                                        --train_file train.conllu \
                           With the client interface, users can annotate text                           --eval_file dev.conllu \
                                                                                                        --gold_file dev.conllu \
                       in 6 languages as supported by CoreNLP.                                          --output_file output.conllu
                       3.3     Interactive Web-based Demo                                        4     PerformanceEvaluation
                       Tohelpvisualize documents and their annotations
                       generated by Sta nz a , we build an interactive web                       Toestablish benchmark results and compare with
                       demo that runs the pipeline interactively. For all                        other popular toolkits, we trained and evaluated
                       languages and all annotations Sta nz a provides in                        Sta nz a  on a total of 112 datasets. All pretrained
                       those languages, we generate predictions from the                         models are publicly downloadable.
                       modelstrainedonthelargesttreebank/NERdataset,                             Datasets.        WetrainandevaluateSta nz a ’s tokeniz-
                       and visualize the result with the Brat rapid annota-                      er/sentence splitter, MWT expander, POS/UFeats
                       tion tool.4 This demo runs in a client/server archi-
                       tecture, and annotation is performed on the server                        tagger, lemmatizer, and dependency parser with
                       side. We make one instance of this demo publicly                          the Universal Dependencies v2.5 treebanks (Ze-
                       available at http://stanza.run/. It can also be                           man et al., 2019). For training we use 100 tree-
                       run locally with proper Python libraries installed.                       banks from this release that have non-copyrighted
                                                                                                 training data, and for treebanks that do not include
                           4https://brat.nlplab.org/                                             development data, we randomly split out 20% of
                                                                                            104
The words contained in this file might help you see if this file matches what you are looking for:

...Sta nz a python natural language processing toolkit for manyhumanlanguages pengqi yuhaozhang yuhuizhang jason bolton christopher d manning stanford university ca yuhuiz edu jebolton abstract hello bonjour hallo tokenization sentence split en fr zh de tokenize weintroduce an open source hola ar ko es ru multi word token expansion support mwt xin chao ja nl vi hi ing human languages compared to ex lemmatization multilingual lemma isting widely used toolkits features raw text agnostic fully neural pipeline pos morphological tagging analysis including words native objects dependency parsing part of depparse head deprel speech and feature named entity recognition pendency recogni ner tion we have trained on total datasets the universal depen processors document dencies treebanks other cor figure overview s nlp pora show that same architec takes as input produces ture generalizes well achieves competitive annotations accessible be performanceonalllanguagestested addition sides this also ally...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area