156x Filetype PDF File size 0.67 MB Source: aclanthology.org
Sta nz a : A Python Natural Language Processing Toolkit for ManyHumanLanguages PengQi* YuhaoZhang* YuhuiZhang Jason Bolton Christopher D. Manning Stanford University Stanford, CA 94305 {pengqi, yuhaozhang, yuhuiz}@stanford.edu {jebolton, manning}@stanford.edu Abstract Hello! Bonjour! 你好! Hallo! Tokenization & Sentence Split EN FR ZH DE TOKENIZE Здравствуйте! Weintroduce Sta nz a , an open-source Python ﺎﺑﺣرﻣ! 안녕하세요! ¡Hola! AR KO ES RU Multi-word Token Expansion natural language processing toolkit support- MWT こんにちは! Hallo! xin chào! नमस्कार! JA NL VI HI ing 66 human languages. Compared to ex- Lemmatization Multilingual: 66 Languages LEMMA isting widely used toolkits, Sta nz a features RAW TEXT a language-agnostic fully neural pipeline for POS & Morphological Tagging POS text analysis, including tokenization, multi- WORDS Native Python Objects Dependency Parsing TOKEN word token expansion, lemmatization, part-of- DEPPARSE LEMMA POS HEAD DEPREL ... speech and morphological feature tagging, de- Named Entity Recognition WORD pendency parsing, and named entity recogni- NER SENTENCE tion. We have trained Sta nz a on a total of Fully Neural: Language-agnostic 112 datasets, including the Universal Depen- PROCESSORS DOCUMENT dencies treebanks and other multilingual cor- Figure 1: Overview of Sta nz a ’s neural NLP pipeline. pora, and show that the same neural architec- Sta nz a takes multilingual text as input, and produces ture generalizes well and achieves competitive annotations accessible as native Python objects. Be- performanceonalllanguagestested. Addition- sides this neural pipeline, Sta nz a also features a ally, Sta nz a includes a native Python interface Python client interface to the Java CoreNLP software. to the widely used Java Stanford CoreNLP software, which further extends its function- ing downstream applications and insights obtained ality to cover other tasks such as coreference fromthem. Third, sometoolsassumeinputtexthas resolution and relation extraction. Source been tokenized or annotated with other tools, lack- code, documentation, and pretrained models ing the ability to process raw text within a unified for 66 languages are available at https:// stanfordnlp.github.io/stanza/. framework. This has limited their wide applicabil- ity to text from diverse sources. 1 Introduction WeintroduceSta nz a 2, a Python natural language Thegrowingavailabilityofopen-sourcenaturallan- processing toolkit supporting many human lan- guage processing (NLP) toolkits has made it easier guages. As shown in Table 1, compared to existing for users to build tools with sophisticated linguistic widely-usedNLPtoolkits, Sta nz a has the following processing. While existing NLP toolkits such as advantages: CoreNLP (Manning et al., 2014), FLAIR (Akbik • From raw text to annotations. Sta nz a fea- 1 et al., 2019), spaCy , and UDPipe (Straka, 2018) tures a fully neural pipeline which takes raw have had wide usage, they also suffer from several text as input, and produces annotations includ- limitations. First, existing toolkits often support ing tokenization, multi-word token expansion, only a few major languages. This has significantly lemmatization, part-of-speech and morpholog- limited the community’s ability to process multilin- ical feature tagging, dependency parsing, and gual text. Second, widely used tools are sometimes namedentity recognition. under-optimized for accuracy either due to a focus • Multilinguality. Sta nz a ’s architectural de- on efficiency (e.g., spaCy) or use of less power- sign is language-agnostic and data-driven, ful models (e.g., CoreNLP), potentially mislead- which allows us to release models support- ∗Equal contribution. Order decided by a tossed coin. 1https://spacy.io/ 2The toolkit was called StanfordNLP prior to v1.0.0. 101 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 101–108 c July 5 - July 10, 2020. 2020 Association for Computational Linguistics System # Human Programming RawText Fully Pretrained State-of-the-art Languages Language Processing Neural Models Performance CoreNLP 6 Java ✦ ✦ FLAIR 12 Python ✦ ✦ ✦ spaCy 10 Python ✦ ✦ UDPipe 61 C++ ✦ ✦ ✦ Sta nz a 66 Python ✦ ✦ ✦ ✦ Table 1: Feature comparisons of Sta nz a against other popular natural language processing toolkits. ing 66 languages, by training the pipeline on (fr) L’Association des Hôtels the Universal Dependencies (UD) treebanks (en) The Association of Hotels (fr) Il y a des hôtels en bas de la rue and other multilingual corpora. (en) There are hotels down the street • State-of-the-art performance. We evaluate Figure 2: An example of multi-word tokens in French. Sta nz a on a total of 112 datasets, and find its Thedesinthefirst sentence corresponds to two syntac- neural pipeline adapts well to text of different tic words, de and les; the second des is a single word. genres, achieving state-of-the-art or competi- Tokenization and Sentence Splitting. When tive performance at each step of the pipeline. presented raw text, Sta nz a tokenizes it and groups Additionally, Sta nz a features a Python interface tokens into sentences as the first step of processing. to the widely used Java CoreNLP package, allow- Unlike most existing toolkits, Sta nz a combines tok- ing access to additional tools such as coreference enization and sentence segmentation from raw text resolution and relation extraction. into a single module. This is modeled as a tagging Sta nz a is fully open source and we make pre- problemovercharactersequences,wherethemodel trained models for all supported languages and predicts whether a given character is the end of a datasets available for public download. We hope Sta token, end of a sentence, or end of a multi-word 3 nz a can facilitate multilingual NLP research and ap- token (MWT, see Figure 2). We choose to predict plications, and drive future research that produces MWTsjointlywithtokenization because this task insights from human languages. is context-sensitive in some languages. 2 SystemDesignandArchitecture Multi-word Token Expansion. Once MWTs are identified by the tokenizer, they are expanded Atthetoplevel, Sta nz a consists of two individual into the underlying syntactic words as the basis components: (1) a fully neural multilingual NLP of downstream processing. This is achieved with pipeline; (2) a Python client interface to the Java an ensemble of a frequency lexicon and a neural Stanford CoreNLP software. In this section we sequence-to-sequence (seq2seq) model, to ensure introduce their designs. that frequently observed expansions in the training 2.1 Neural Multilingual NLP Pipeline set are always robustly expanded while maintaining flexibility to model unseen words statistically. Sta nz a ’s neural pipeline consists of models that POSandMorphologicalFeatureTagging. For range from tokenizing raw text to performing syn- each word in a sentence, Sta nz a assigns it a part- tactic analysis on entire sentences (see Figure 1). of-speech (POS), and analyzes its universal mor- All componentsaredesignedwithprocessingmany phological features (UFeats, e.g., singular/plural, humanlanguagesinmind,withhigh-level design st nd rd choices capturing common phenomena in many 1 /2 /3 person,etc.). TopredictPOSandUFeats, languages and data-driven models that learn the dif- we adopt a bidirectional long short-term mem- ference between these languages from data. More- ory network (Bi-LSTM) as the basic architecture. over, the implementation of Sta nz a components is For consistency among universal POS (UPOS), highly modular, and reuses basic model architec- 3Following Universal Dependencies (Nivre et al., 2020), tures when possible for compactness. We highlight wemakeadistinction between tokens (contiguous spans of the important design choices here, and refer the characters in the input text) and syntactic words. These are interchangeable aside from the cases of MWTs, where one reader to Qi et al. (2018) for modeling details. token can correspond to multiple words. 102 treebank-specific POS (XPOS), and UFeats, we existing server interface in CoreNLP, and imple- adopt the biaffine scoring mechanism from Dozat ment a robust client as its Python interface. and Manning (2017) to condition XPOS and WhentheCoreNLPclient is instantiated, Sta nz UFeats prediction on that of UPOS. a will automatically start the CoreNLP server as a local process. The client then communicates with Lemmatization. Sta nz a also lemmatizes each the server through its RESTful APIs, after which word in a sentence to recover its canonical form annotationsaretransmittedinProtocolBuffers,and (e.g., did→do). Similar to the multi-word token ex- converted back to native Python objects. Users can pander, Sta nz a ’s lemmatizer is implemented as an also specify JSON or XML as annotation format. ensemble of a dictionary-based lemmatizer and a Toensurerobustness, while the client is being used, neural seq2seq lemmatizer. An additional classifier Sta nz a periodically checks the health of the server, is built on the encoder output of the seq2seq model, and restarts it if necessary. to predict shortcuts such as lowercasing and iden- tity copy for robustness on long input sequences 3 SystemUsage such as URLs. Sta nz a ’s user interface is designed to allow quick Dependency Parsing. Sta nza parses each sen- out-of-the-box processing of multilingual text. To tence for its syntactic structure, where each word achieve this, Sta nz a supports automated model in the sentence is assigned a syntactic head that download via Python code and pipeline customiza- is either another word in the sentence, or in the tion with processors of choice. Annotation results case of the root word, an artificial root symbol. We can be accessed as native Python objects to allow implement a Bi-LSTM-based deep biaffine neural for flexible post-processing. dependencyparser(DozatandManning,2017). We 3.1 Neural Pipeline Interface further augment this model with two linguistically motivated features: one that predicts the lineariza- Sta nz a ’s neural NLP pipeline can be initialized tion order of two words in a given language, and with the Pipeline class, taking language name the other that predicts the typical distance in linear as an argument. By default, all processors will be order between them. We have previously shown loaded and run over the input text; however, users that these features significantly improve parsing canalsospecifytheprocessorstoloadandrunwith accuracy (Qi et al., 2018). a list of processor names as an argument. Users NamedEntityRecognition. Foreachinputsen- can additionally specify other processor-level prop- tence, Sta nz a also recognizes named entities in it erties, such as batch sizes used by processors, at (e.g., person names, organizations, etc.). For NER initialization time. weadoptthecontextualized string representation- Thefollowing code snippet shows a minimal us- based sequence tagger from Akbik et al. (2018). age of Sta nz a for downloading the Chinese model, Wefirsttrain a forward and a backward character- annotating a sentence with customized processors, level LSTM language model, and at tagging time and printing out all annotations: we concatenate the representations at the end of import stanza each word position from both language models # download Chinese model stanza.download(’zh’) with word embeddings, and feed the result into a # initialize Chinese neural pipeline nlp = stanza.Pipeline(’zh’, processors=’tokenize, standard one-layer Bi-LSTM sequence tagger with pos,ner’) a conditional random field (CRF)-based decoder. # run annotation over a sentence doc = nlp(’斯坦福是一所私立研究型大学。’) print(doc) 2.2 CoreNLPClient After all processors are run, a Document in- Stanford’s Java CoreNLP software provides a com- stance will be returned, which stores all annotation prehensive set of NLP tools especially for the En- results. Within a Document, annotations are fur- glish language. However, these tools are not easily ther stored in Sentences, Tokens and Words accessible with Python, the programming language in a top-down fashion (Figure 1). The following of choice for many NLP practitioners, due to the code snippet demonstrates how to access the text lack of official support. To facilitate the use of and POS tag of each word in a document and all CoreNLPfromPython, we take advantage of the namedentities in the document: 103 # print the text and POS of all words for sentence in doc.sentences: for word in sentence.words: print(word.text, word.pos) # print all entities in the document print(doc.entities) Sta nz a is designed to be run on different hard- ware devices. By default, CUDA devices will be used whenever they are visible by the pipeline, or otherwise CPUs will be used. However, users can force all computation to be run on CPUs by setting use_gpu=Falseatinitialization time. 3.2 CoreNLPClientInterface TheCoreNLPclientinterface is designed in a way that the actual communication with the backend Figure 3: Sta nz a annotates a German sentence, as vi- CoreNLPserver is transparent to the user. To an- sualized by our interactive demo. Note am is expanded notate an input text with the CoreNLP client, a into syntactic words an and dem before downstream CoreNLPClientinstanceneedstobeinitialized, analyses are performed. with an optional list of CoreNLP annotators. After Anexample of running Sta nz a on a German sen- the annotation is complete, results will be accessi- tence can be found in Figure 3. ble as native Python objects. This code snippet shows how to establish a 3.4 Training Pipeline Models CoreNLP client and obtain the NER and corefer- For all neural processors, Sta nz a provides ence annotations of an English sentence: command-line interfaces for users to train their from stanza.server import CoreNLPClient own customized models. To do this, users need # start a CoreNLP client to prepare the training and development data in with CoreNLPClient(annotators=[’tokenize’,’ssplit compatible formats (i.e., CoNLL-U format for the ’,’pos’,’lemma’,’ner’,’parse’,’coref’]) as Universal Dependencies pipeline and BIO format client: # run annotation over input column files for the NER model). The following ann = client.annotate(’Emily said that she liked the movie.’) commandtrains a neural dependency parser with # access all entities for sent in ann.sentence: user-specified training and development data: print(sent.mentions) # access coreference annotations print(ann.corefChain) $ python -m stanza.models.parser \ --train_file train.conllu \ With the client interface, users can annotate text --eval_file dev.conllu \ --gold_file dev.conllu \ in 6 languages as supported by CoreNLP. --output_file output.conllu 3.3 Interactive Web-based Demo 4 PerformanceEvaluation Tohelpvisualize documents and their annotations generated by Sta nz a , we build an interactive web Toestablish benchmark results and compare with demo that runs the pipeline interactively. For all other popular toolkits, we trained and evaluated languages and all annotations Sta nz a provides in Sta nz a on a total of 112 datasets. All pretrained those languages, we generate predictions from the models are publicly downloadable. modelstrainedonthelargesttreebank/NERdataset, Datasets. WetrainandevaluateSta nz a ’s tokeniz- and visualize the result with the Brat rapid annota- er/sentence splitter, MWT expander, POS/UFeats tion tool.4 This demo runs in a client/server archi- tecture, and annotation is performed on the server tagger, lemmatizer, and dependency parser with side. We make one instance of this demo publicly the Universal Dependencies v2.5 treebanks (Ze- available at http://stanza.run/. It can also be man et al., 2019). For training we use 100 tree- run locally with proper Python libraries installed. banks from this release that have non-copyrighted training data, and for treebanks that do not include 4https://brat.nlplab.org/ development data, we randomly split out 20% of 104
no reviews yet
Please Login to review.