Computer Science Thesis Pdf 174894

Partial capture of text on file.
                      ImprovingMathWordProblemswithPre-trainedKnowledgeand
                                                   Hierarchical Reasoning
                                                                                                  ∗
                                    Weijiang Yu, Yingpeng Wen, Fudan Zheng, Nong Xiao
                              School of Computer Science and Engineering, Sun Yat-sen University
                     weijiangyu8@gmail.com,{wenyp6,zhengfd3}@mail2.sysu.edu.cn,
                                                 xiaon6@mail.sysu.edu.cn
                                    Abstract                          Problem: Conner has 25000 dollars in his bank
                                                                      account. Every month he spends 1500 dollars. He
                    The recent algorithms for math word prob-         does not add money to the account. How much
                    lems (MWP) neglect to use outside knowl-          money will Conner have in his account after 8
                    edge not present in the problems.    Most         months?
                    of them only capture the word-level relation-     Equation: x = 25000.0−(1500.0∗8.0);
                    ship and ignore to build hierarchical reason-     Solution: 13000.0
                    ing like the human being for mining the
                    contextual structure between words and sen-      Table 1: The example of the math word problem task.
                    tences. In this paper, we propose a Reasoning    Given a natural language description for a mathemat-
                    with Pre-trained Knowledge and Hierarchical      ical problem, it requires the model to infer a formal
                    Structure (RPKHS) network, which contains        mathequation and ﬁnal quantity solution.
                    a pre-trained knowledge encoder and a hier-
                    archical reasoning encoder. Firstly, our pre-
                    trained knowledge encoder aims at reasoning
                    the MWP by using outside knowledge from          2017b;Wangetal.,2017;Huangetal.,2018;Wang
                    thepre-trainedtransformer-basedmodels. Sec-      et al., 2017). These Seq2Seq-based methods aim
                    ondly, the hierarchical reasoning encoder is     to train an end-to-end model from scratch by using
                    presented for seamlessly integrating the word-   the training dataset. Some research focuses on de-
                    level and sentence-level reasoning to bridge     veloping structure-based approaches (Xie and Sun,
                    the entity and context domain on MWP. Exten-     2019a;Wangetal.,2018a,2019b;Liuetal.,2019a;
                    sive experimentsshowthatourRPKHSsigniﬁ-          Zhang et al., 2020b; Li et al., 2020b; Hong et al.,
                    cantly outperforms state-of-the-art approaches   2021;Lietal.,2020a)byincorporatingparsingtree
                    on two large-scale commonly-used datasets,
                    and boosts performance from 77.4% to 83.9%       into the neural models to produce promising results
                    onMath23K,from75.5to82.2%onMath23K               in generating solution expression for the MWP.
                    with5-foldcross-validationandfrom83.7%to           Toanswerthis question, human beings not only
                    89.8% on MAWPS. More extensive ablations         need to parse the question and understand the con-
                    areshowntodemonstratetheeffectivenessand         text but also use external knowledge. However,
                    interpretability of our proposed method.         the previous methods learn the textual description
                1   Introduction                                     purely from the short and limited narrative without
                                                                     using any background knowledge that not present
                Math Word Problem (MWP) is a reasoning task          in the description, which restrain the ability of the
                for answering a mathematical query based on the      models for inferring the MWP from a global per-
                problem description, which is an interdisciplinary   spective. Moreover, current methods mainly fo-
                research topic to bridge the mathematics and nat-    cus on designing diverse entity-level structures for
                ural language processing. As shown in Table 1, a     word-level reasoning rather than bridging the hier-
                short narrative is presented to describe a problem   archical reasoning between the entity (word-level)
                and poses a question about the unknown quantity.     and context (sentence-level). Obviously, it is not
                In recent years, research on MWP by using deep       enough to use single-level reasoning for solving
                learning methods has been gaining increasing at-     the MWP.Inthis paper, we propose reasoning with
                tention. Early research mainly focuses on Seq2Seq-   pre-trained knowledge and hierarchical structure
                based models (Sutskever et al., 2014; Ling et al.,   (RPKHS)tojointly solve the two limitations.
                    ∗Corresponding Author                              Our RPKHS as shown in Figure 2 consists of
                                                                 3384
                      Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3384–3394
                                                         c
                                     November7–11,2021. 
2021Association for Computational Linguistics
                                                                                                   Conner has …
                                    dollars                                          Every month …            How much …?
                       25000                                     Conner has …
                                                   Every month …
                                 8                                                                       He does …
                                         1500
                  money                                                                     25000
                                                                   How much …?                     money              8
                                       month
                         account                         He does …                   dollars
                                                                                                       account    month
                                                                                            1500
                     (a) Word-level Reasoning        (b) Sentence-level Reasoning            (c) Hierarchical Reasoning
                 Figure 1: (a) Word-level reasoning is to build the relationship of each word in all textual descriptions, which can
                 also be considered as entity-level reasoning; (b) Sentence-level reasoning aims at mining the intra-relationship of
                 each sentence from the paragraph. (c) Hierarchical reasoning is to jointly excavate intra-relationship and inter-
                 relationship between word and sentence from the same paragraph.
                 two encoders, namely pre-trained knowledge en-         this paper, we take advantage of the implicit knowl-
                 coder and hierarchical reasoning encoder, and a        edge in pre-trained Roberta (Liu et al., 2019c) and
                 tree-structured decoder. It effectively incorporates   analyze the effect of various pre-trained knowledge
                 the implicit linguistic knowledge into the model       on the MWPtask.
                 via pre-trained knowledge encoder and generates           CurrentmethodsmainlylearntheMWPbybuild-
                 structural representation by our hierarchical rea-     ing word-level reasoning (as shown in Figure 1 (a))
                 soning encoder. The outputs of the two encoders        by GNN (Zhang et al., 2020b; Li et al., 2020b)
                 are fed into a tree-structured decoder (Xie and Sun,   and Seq2Seq model (Wang et al., 2017). They
                 2019b) for ﬁnal prediction.                            seldom consider modeling hierarchical structure.
                    Tothebestofourknowledge,wearetheﬁrstone             Since the descriptions of MWP have a hierarchical
                 to study the application of pre-trained knowledge to   structure (words from sentences, sentences from a
                 the MWPtask. We have implicit knowledge which          narrative), we likewise construct hierarchical rea-
                 is embedded into some non-symbolic form such as        soning (as shown in Figure 1 (c)) by ﬁrst building
                 the weights of a neural network derived from an-       representations of sentences from words, and then
                 notated data or large-scale unsupervised language      aggregating those into a whole narrative represen-
                 training. Recently, Transformer-based (Vaswani         tation.
                 et al., 2017) and speciﬁcally BERT-based (Devlin          It is observed that different words and sentences
                 et al., 2019b; Liu et al., 2019c) models have been     in a mathematical narrative are differentially infor-
                 proposed, which incorporate large-scale linguistic     mative. The importance of words and sentences are
                 pre-training, implicitly capturing language-based      highly context-dependent, i.e. the same word or
                 knowledge. This type of knowledge can be quite         sentence may be differentially important in differ-
                 useful for parsing the textual description.            ent contexts (e.g., 5 dollars and 5 pencils, the word
                    For example, there are two sentences: ‘He has       of 5 has different meanings.). To include sensitivity
                 25000dollars in his bank account.’; ‘Paul appeared     to this fact, our model includes two levels of rea-
                 before the faculty to account for his various misde-   soning mechanisms. One at the word-level and one
                 meanors’. The word ‘account’ has totally different     at the sentence-level. They lead the model to pay
                 meanings between the two sentences due to differ-      moreorless attention to individual words and sen-
                 ent scene-awaredescriptions. Hence, wethinksuch        tences when constructing the representation of the
                 diverse semantics of each word containing rich rep-    narrative. Taking an example as shown in Table 1,
                 resentation in the implicit pre-trained knowledge.     intuitively, the ﬁrst, second and fourth sentences
                 Suchknowledgecanbealsoregardedasahugeim-               have stronger information in assisting the predic-
                 plicitly vocabulary to endow each word with rich       tion of the solution. Within these sentences, the
                 representation. It can help the model to parse the     words 25000 dollars and every month contribute
                 correct semantics of words from complex text. In       more in inferring the math-aware results. In this
                                                                    3385
                  paper, we propose a hierarchical reasoning encoder       3   Methodology
                  to achieve this functionality.                           3.1   Overview
                  Contributions. (1) As far as we know, we are the
                  ﬁrst one to explore pre-trained knowledge on the         In this section, we explain the architecture and
                  MWPtaskviaourpre-trained knowledge encoder.              design of our proposed RPKHS network (i.e. Rea-
                  (2) We propose a hierarchical reasoning encoder to       soning with Pre-trained Knowledge and Hierarchi-
                  seamlessly integrate the word-level and sentence-        cal Structure) composed of pre-trained knowledge
                  level reasoning for bridging the entity and context      encoder, hierarchical reasoning encoder and tree-
                  domain on MWP. It can provide insight into which         structured decoder, which can appropriately incor-
                  words and sentences contribute to the prediction         porate the outside knowledge into the model and
                  which can be of value in applications and analysis.      bridge the hierarchical reasoning between the en-
                  (3) Our RPKHS outperforms previous approaches            tity (word-level) and context (sentence-level). The
                  by a signiﬁcant margin.                                  overview of our RPKHS is illustrated in Figure 2.
                                                                           Our contributions mainly focus on the design of
                  2   Related Work                                         a joint-learning framework and two innovative en-
                                                                           coders on the MWP task, which are unveiled and
                  The MWP is the task of translating a short para-         discussed in detail in the following sections.
                  graph consisting with multiple short sentences           3.2   ProblemFormulation
                  into target mathematical equations. Previous ap-
                  proaches usually solve the MWP by using rule-            The math word problems (MWP) can be formu-
                  based methods (Yuhui et al., 2010; Bakman, 2007),        lated as (P,E), where P is the problem text and E
                  statistical machine learning methods (Kushman            is a solution expression. Assuming a description of
                  et al., 2014; Mitra and Baral, 2016; Roy and Roth,       MWPhasLsentencessi,andeachsentencecon-
                                                                           tains T words. w with t ∈ [1,T] represents the
                  2018; Zou and Lu, 2019), semantic parsing meth-                 i           it
                  ods (Shi et al., 2015; Roy and Roth, 2015; Huang        words in the i-th sentence. Our proposed encoders
                  et al., 2017) and deep learning methods (Ling et al.,    project the raw problem descriptions into a vector
                  2017a; Wang et al., 2018b; Liu et al., 2019b; Wang       representation, on which we build a tree-structured
                  et al., 2017; Zhang et al., 2020a). Recently, the        decoder to predict the mathematical expression.
                  deep learning based methods have been paid more          3.3   Pre-trained Knowledge Encoder
                  attention for their signiﬁcant improvement. (Wang
                  et al., 2017) proposed a Seq2Seq-based model to di-     We want to incorporate implicit external knowl-
                  rectly map the linguistic text to a solution. (Wang      edge as well as math-aware knowledge which can
                  et al., 2018b) and (Chiang and Chen, 2019) im-           be learned from the training set in our model. Lan-
                  plicitly modeled tree-based structure for decoding       guage models, and especially transformer-based
                  the MWPexpressions, while (Wang et al., 2019a;           language models, have shown to contain com-
                  Liu et al., 2019b; Xie and Sun, 2019b) optimized         monsense and factual knowledge (Petroni et al.,
                  the decoder via explicit tree structure. Some re-        2019; Jiang et al., 2019). We adopt this direc-
                  search focused on graph structure on word-level          tion in our model and build an encoder, pre-trained
                  reasoning. For example, (Zhang et al., 2020a) built     with Roberta (Liu et al., 2019c), which has been
                  two customized graphs for enriching the quantity         pre-trained on the huge language corpora (e.g.,
                  representations in the problem. (Li et al., 2020b)       BooksCorpus (Zhu et al., 2015), Wikipedia (Remy,
                  presented a graph-to-tree encoder-decoder frame-         2002)) to capture implicit knowledge. We tokenize
                  workfor grammar parsing.                                 a description Q using WordPiece (Wu et al., 2016)
                    However, they ignore the sentence-level relation-      as in BERT (Devlin et al., 2019a), giving us a se-
                  shipandthecorrelationbetweenwordandsentence.             quence of |Q| tokens and embed them with the pre-
                  Different from the previous methods, we propose          trained Roberta embeddings and append Roberta’s
                  to use hierarchical reasoning containing word-level      positional encoding, giving us a sequence of d-
                                                                                                                Q       Q
                  and sentence-level reasoning. Besides, we are the        dimensional token representation x ,...,x      . We
                                                                                                                1       |Q|
                  ﬁrst ones to explore the effect of implicit knowl-       feed these into the transformer-based pre-trained
                  edge from the pre-trained neural network weights         knowledge encoder, ﬁne-tuning the representation
                  on the task of math word problems.                       during training. We mean-pool the output of all
                                                                      3386
                                                                                                 Hierarchical Reasoning Encoder
                                   If   6 times a number is
                                   decreased by 5, the result is
                                   7 more than 3 times the
                                   sum of the number and 13.                                                                                                                                        Tree-structured Decoder
                                   Whatisthenumber?
                                                                                                                                                                              h                                              /
                                                                                                                                                                          w
                                                                                                                                                                                                                    +                  -
                                                   Word                                                                                                                                  FC                            +         6.0        3.0
                                              Embedding                                                                                                                                                    5.0
                                                                                                 Pre-trained Knowledge Encoder                                                                                 7.0
                                                                                                                                        N                                                                                     *
                                                                                                                                                                                p
                                                                                                                                                                             w
                                                                                                              N                       N                                                                                 13.0         3.0
                                                                                                At    ultM    orm &    orwF  F        orm &                  Li
                                                                                                ten   he-i             ar    eed                             near
                                                                                                tion  ad      Add      d              Add                                            Equation: (3.0*13.0+7.0+5.0)/(6.0-3.0)
                                                                                                                                                                                     Solution: 17.0
                                           Concatenation                           Word Feature                           Sentence Feature                             Tree Node Feature                           Voting Mechanism
                                   Figure 2: Overview of our Reasoning with Pre-trained Knowledge and Hierarchical Structure (RPKHS). The
                                   hierarchical reasoning encoder receives the textual embedding to construct inter-relationship between sentence
                                   and word to aggregate semantics among entity and context. The pre-trained knowledge encoder captures a large
                                   amountofknowledgeaboutthelinguisticworldfromthepre-trainednetworkweights,andincorporatestheimplicit
                                   knowledge into the input embedding to enrich the input representation. Then we concatenate the results from two
                                   encoders as the input of a tree-structured decoder for parsing the target mathematical equation and solution.
                                   transformer steps to get our combined implicit                                                                    the state of sequences without using separate mem-
                                   knowledge representation Yp.                                                                                      ory cells. There are two types of gates: the reset
                                                                                                                                                     gate r and the update gate z . They jointly control
                                   3.4         Hierarchical Reasoning Encoder                                                                                    t                                              t
                                                                                                                                                     howinformation is updated to the state. At time t,
                                   Theproposed hierarchical reasoning encoder takes                                                                  the GRUcomputesthenewstateas
                                   into account that the different parts of a math                                                                                                                                                 ˆ
                                                                                                                                                                    h =(1−z)⊙h                                     +z ⊙h .                              (1)
                                   description have no similar relevant information.                                                                                   t                      t            t−1             t          t−1
                                   Moreover, determining the relevant sections in-                                                                   This is a linear interpolation between the previous
                                   volves modeling the interactions among the words,                                                                                                                                               ˆ
                                                                                                                                                     state h               and the current new state h computed
                                   not just their isolated presence in the text. There-                                                                            t−1                                                                t
                                                                                                                                                    with new sequence information. The gate zt de-
                                   fore, to consider this aspect, the model includes two                                                             cides how much past information is kept and how
                                   levels of reasoning mechanisms. One reasoning at                                                                  muchnewinformationisadded. zt is updated as
                                   the word level and the other at the sentence level,
                                   which let the model pay more or less attention to                                                                                     z =σ(W x +U h                                       +b ),                      (2)
                                   individual words and sentences when constructing                                                                                        t                  z t              z t−1                z
                                                                                                                                                    where x is the sequence vector at time t. The
                                   the wholedescriptionrepresentation. Thehierarchi-                                                                                   t
                                                                                                                                                                                     ˆ
                                                                                                                                                     candidate state h is computed by
                                   cal reasoning encoder is composed of 2 layers. The                                                                                                   t
                                   ﬁrst layer is our word-level reasoning layer and                                                                       ˆ
                                                                                                                                                          h =tanh(W x +r ⊙(U h                                                    ) +b ),               (3)
                                   the second layer is the sentence-level reasoning                                                                          t                          h t             t             h t−1                  h
                                   layer. In the following sections, we ﬁrst introduce                                                              wherer isthereset gate which controls how much
                                   the GRU-based operation commonly used in our                                                                                      t
                                                                                                                                                     the previous state contributes to the candidate state.
                                   twolayers. Then we present the details of the two                                                                 If r is zero, then it forgets the past state. The reset
                                   reasoning layers.                                                                                                        t
                                                                                                                                                     gate is updated by
                                   GRU-basedSequenceEncoding. TheGRU(Bah-                                                                                                r =σ(W x +U h                                       +b ).                      (4)
                                   danauetal.,2015)usesagatingmechanismtotrack                                                                                             t                  r t              r t−1                r
                                                                                                                                            3387
The words contained in this file might help you see if this file matches what you are looking for:

...Improvingmathwordproblemswithpre trainedknowledgeand hierarchical reasoning weijiang yu yingpeng wen fudan zheng nong xiao school of computer science and engineering sun yat sen university weijiangyu gmail com wenyp zhengfd mail sysu edu cn xiaon abstract problem conner has dollars in his bank account every month he spends the recent algorithms for math word prob does not add money to how much lems mwp neglect use outside knowl will have after edge present problems most months them only capture level relation equation x ship ignore build reason solution ing like human being mining contextual structure between words table example task tences this paper we propose a given natural language description mathemat with pre trained knowledge ical it requires model infer formal rpkhs network which contains mathequation nal quantity encoder hier archical firstly our aims at by using from b wangetal huangetal wang thepre trainedtransformer basedmodels sec et al these seqseq based methods aim ondl...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area