134x Filetype PDF File size 0.67 MB Source: aclanthology.org
ImprovingMathWordProblemswithPre-trainedKnowledgeand Hierarchical Reasoning ∗ Weijiang Yu, Yingpeng Wen, Fudan Zheng, Nong Xiao School of Computer Science and Engineering, Sun Yat-sen University weijiangyu8@gmail.com,{wenyp6,zhengfd3}@mail2.sysu.edu.cn, xiaon6@mail.sysu.edu.cn Abstract Problem: Conner has 25000 dollars in his bank account. Every month he spends 1500 dollars. He The recent algorithms for math word prob- does not add money to the account. How much lems (MWP) neglect to use outside knowl- money will Conner have in his account after 8 edge not present in the problems. Most months? of them only capture the word-level relation- Equation: x = 25000.0−(1500.0∗8.0); ship and ignore to build hierarchical reason- Solution: 13000.0 ing like the human being for mining the contextual structure between words and sen- Table 1: The example of the math word problem task. tences. In this paper, we propose a Reasoning Given a natural language description for a mathemat- with Pre-trained Knowledge and Hierarchical ical problem, it requires the model to infer a formal Structure (RPKHS) network, which contains mathequation and final quantity solution. a pre-trained knowledge encoder and a hier- archical reasoning encoder. Firstly, our pre- trained knowledge encoder aims at reasoning the MWP by using outside knowledge from 2017b;Wangetal.,2017;Huangetal.,2018;Wang thepre-trainedtransformer-basedmodels. Sec- et al., 2017). These Seq2Seq-based methods aim ondly, the hierarchical reasoning encoder is to train an end-to-end model from scratch by using presented for seamlessly integrating the word- the training dataset. Some research focuses on de- level and sentence-level reasoning to bridge veloping structure-based approaches (Xie and Sun, the entity and context domain on MWP. Exten- 2019a;Wangetal.,2018a,2019b;Liuetal.,2019a; sive experimentsshowthatourRPKHSsignifi- Zhang et al., 2020b; Li et al., 2020b; Hong et al., cantly outperforms state-of-the-art approaches 2021;Lietal.,2020a)byincorporatingparsingtree on two large-scale commonly-used datasets, and boosts performance from 77.4% to 83.9% into the neural models to produce promising results onMath23K,from75.5to82.2%onMath23K in generating solution expression for the MWP. with5-foldcross-validationandfrom83.7%to Toanswerthis question, human beings not only 89.8% on MAWPS. More extensive ablations need to parse the question and understand the con- areshowntodemonstratetheeffectivenessand text but also use external knowledge. However, interpretability of our proposed method. the previous methods learn the textual description 1 Introduction purely from the short and limited narrative without using any background knowledge that not present Math Word Problem (MWP) is a reasoning task in the description, which restrain the ability of the for answering a mathematical query based on the models for inferring the MWP from a global per- problem description, which is an interdisciplinary spective. Moreover, current methods mainly fo- research topic to bridge the mathematics and nat- cus on designing diverse entity-level structures for ural language processing. As shown in Table 1, a word-level reasoning rather than bridging the hier- short narrative is presented to describe a problem archical reasoning between the entity (word-level) and poses a question about the unknown quantity. and context (sentence-level). Obviously, it is not In recent years, research on MWP by using deep enough to use single-level reasoning for solving learning methods has been gaining increasing at- the MWP.Inthis paper, we propose reasoning with tention. Early research mainly focuses on Seq2Seq- pre-trained knowledge and hierarchical structure based models (Sutskever et al., 2014; Ling et al., (RPKHS)tojointly solve the two limitations. ∗Corresponding Author Our RPKHS as shown in Figure 2 consists of 3384 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3384–3394 c November7–11,2021. 2021Association for Computational Linguistics Conner has … dollars Every month … How much …? 25000 Conner has … Every month … 8 He does … 1500 money 25000 How much …? money 8 month account He does … dollars account month 1500 (a) Word-level Reasoning (b) Sentence-level Reasoning (c) Hierarchical Reasoning Figure 1: (a) Word-level reasoning is to build the relationship of each word in all textual descriptions, which can also be considered as entity-level reasoning; (b) Sentence-level reasoning aims at mining the intra-relationship of each sentence from the paragraph. (c) Hierarchical reasoning is to jointly excavate intra-relationship and inter- relationship between word and sentence from the same paragraph. two encoders, namely pre-trained knowledge en- this paper, we take advantage of the implicit knowl- coder and hierarchical reasoning encoder, and a edge in pre-trained Roberta (Liu et al., 2019c) and tree-structured decoder. It effectively incorporates analyze the effect of various pre-trained knowledge the implicit linguistic knowledge into the model on the MWPtask. via pre-trained knowledge encoder and generates CurrentmethodsmainlylearntheMWPbybuild- structural representation by our hierarchical rea- ing word-level reasoning (as shown in Figure 1 (a)) soning encoder. The outputs of the two encoders by GNN (Zhang et al., 2020b; Li et al., 2020b) are fed into a tree-structured decoder (Xie and Sun, and Seq2Seq model (Wang et al., 2017). They 2019b) for final prediction. seldom consider modeling hierarchical structure. Tothebestofourknowledge,wearethefirstone Since the descriptions of MWP have a hierarchical to study the application of pre-trained knowledge to structure (words from sentences, sentences from a the MWPtask. We have implicit knowledge which narrative), we likewise construct hierarchical rea- is embedded into some non-symbolic form such as soning (as shown in Figure 1 (c)) by first building the weights of a neural network derived from an- representations of sentences from words, and then notated data or large-scale unsupervised language aggregating those into a whole narrative represen- training. Recently, Transformer-based (Vaswani tation. et al., 2017) and specifically BERT-based (Devlin It is observed that different words and sentences et al., 2019b; Liu et al., 2019c) models have been in a mathematical narrative are differentially infor- proposed, which incorporate large-scale linguistic mative. The importance of words and sentences are pre-training, implicitly capturing language-based highly context-dependent, i.e. the same word or knowledge. This type of knowledge can be quite sentence may be differentially important in differ- useful for parsing the textual description. ent contexts (e.g., 5 dollars and 5 pencils, the word For example, there are two sentences: ‘He has of 5 has different meanings.). To include sensitivity 25000dollars in his bank account.’; ‘Paul appeared to this fact, our model includes two levels of rea- before the faculty to account for his various misde- soning mechanisms. One at the word-level and one meanors’. The word ‘account’ has totally different at the sentence-level. They lead the model to pay meanings between the two sentences due to differ- moreorless attention to individual words and sen- ent scene-awaredescriptions. Hence, wethinksuch tences when constructing the representation of the diverse semantics of each word containing rich rep- narrative. Taking an example as shown in Table 1, resentation in the implicit pre-trained knowledge. intuitively, the first, second and fourth sentences Suchknowledgecanbealsoregardedasahugeim- have stronger information in assisting the predic- plicitly vocabulary to endow each word with rich tion of the solution. Within these sentences, the representation. It can help the model to parse the words 25000 dollars and every month contribute correct semantics of words from complex text. In more in inferring the math-aware results. In this 3385 paper, we propose a hierarchical reasoning encoder 3 Methodology to achieve this functionality. 3.1 Overview Contributions. (1) As far as we know, we are the first one to explore pre-trained knowledge on the In this section, we explain the architecture and MWPtaskviaourpre-trained knowledge encoder. design of our proposed RPKHS network (i.e. Rea- (2) We propose a hierarchical reasoning encoder to soning with Pre-trained Knowledge and Hierarchi- seamlessly integrate the word-level and sentence- cal Structure) composed of pre-trained knowledge level reasoning for bridging the entity and context encoder, hierarchical reasoning encoder and tree- domain on MWP. It can provide insight into which structured decoder, which can appropriately incor- words and sentences contribute to the prediction porate the outside knowledge into the model and which can be of value in applications and analysis. bridge the hierarchical reasoning between the en- (3) Our RPKHS outperforms previous approaches tity (word-level) and context (sentence-level). The by a significant margin. overview of our RPKHS is illustrated in Figure 2. Our contributions mainly focus on the design of 2 Related Work a joint-learning framework and two innovative en- coders on the MWP task, which are unveiled and The MWP is the task of translating a short para- discussed in detail in the following sections. graph consisting with multiple short sentences 3.2 ProblemFormulation into target mathematical equations. Previous ap- proaches usually solve the MWP by using rule- The math word problems (MWP) can be formu- based methods (Yuhui et al., 2010; Bakman, 2007), lated as (P,E), where P is the problem text and E statistical machine learning methods (Kushman is a solution expression. Assuming a description of et al., 2014; Mitra and Baral, 2016; Roy and Roth, MWPhasLsentencessi,andeachsentencecon- tains T words. w with t ∈ [1,T] represents the 2018; Zou and Lu, 2019), semantic parsing meth- i it ods (Shi et al., 2015; Roy and Roth, 2015; Huang words in the i-th sentence. Our proposed encoders et al., 2017) and deep learning methods (Ling et al., project the raw problem descriptions into a vector 2017a; Wang et al., 2018b; Liu et al., 2019b; Wang representation, on which we build a tree-structured et al., 2017; Zhang et al., 2020a). Recently, the decoder to predict the mathematical expression. deep learning based methods have been paid more 3.3 Pre-trained Knowledge Encoder attention for their significant improvement. (Wang et al., 2017) proposed a Seq2Seq-based model to di- We want to incorporate implicit external knowl- rectly map the linguistic text to a solution. (Wang edge as well as math-aware knowledge which can et al., 2018b) and (Chiang and Chen, 2019) im- be learned from the training set in our model. Lan- plicitly modeled tree-based structure for decoding guage models, and especially transformer-based the MWPexpressions, while (Wang et al., 2019a; language models, have shown to contain com- Liu et al., 2019b; Xie and Sun, 2019b) optimized monsense and factual knowledge (Petroni et al., the decoder via explicit tree structure. Some re- 2019; Jiang et al., 2019). We adopt this direc- search focused on graph structure on word-level tion in our model and build an encoder, pre-trained reasoning. For example, (Zhang et al., 2020a) built with Roberta (Liu et al., 2019c), which has been two customized graphs for enriching the quantity pre-trained on the huge language corpora (e.g., representations in the problem. (Li et al., 2020b) BooksCorpus (Zhu et al., 2015), Wikipedia (Remy, presented a graph-to-tree encoder-decoder frame- 2002)) to capture implicit knowledge. We tokenize workfor grammar parsing. a description Q using WordPiece (Wu et al., 2016) However, they ignore the sentence-level relation- as in BERT (Devlin et al., 2019a), giving us a se- shipandthecorrelationbetweenwordandsentence. quence of |Q| tokens and embed them with the pre- Different from the previous methods, we propose trained Roberta embeddings and append Roberta’s to use hierarchical reasoning containing word-level positional encoding, giving us a sequence of d- Q Q and sentence-level reasoning. Besides, we are the dimensional token representation x ,...,x . We 1 |Q| first ones to explore the effect of implicit knowl- feed these into the transformer-based pre-trained edge from the pre-trained neural network weights knowledge encoder, fine-tuning the representation on the task of math word problems. during training. We mean-pool the output of all 3386 Hierarchical Reasoning Encoder If 6 times a number is decreased by 5, the result is 7 more than 3 times the sum of the number and 13. Tree-structured Decoder Whatisthenumber? h / w + - Word FC + 6.0 3.0 Embedding 5.0 Pre-trained Knowledge Encoder 7.0 N * p w N N 13.0 3.0 At ultM orm & orwF F orm & Li ten he-i ar eed near tion ad Add d Add Equation: (3.0*13.0+7.0+5.0)/(6.0-3.0) Solution: 17.0 Concatenation Word Feature Sentence Feature Tree Node Feature Voting Mechanism Figure 2: Overview of our Reasoning with Pre-trained Knowledge and Hierarchical Structure (RPKHS). The hierarchical reasoning encoder receives the textual embedding to construct inter-relationship between sentence and word to aggregate semantics among entity and context. The pre-trained knowledge encoder captures a large amountofknowledgeaboutthelinguisticworldfromthepre-trainednetworkweights,andincorporatestheimplicit knowledge into the input embedding to enrich the input representation. Then we concatenate the results from two encoders as the input of a tree-structured decoder for parsing the target mathematical equation and solution. transformer steps to get our combined implicit the state of sequences without using separate mem- knowledge representation Yp. ory cells. There are two types of gates: the reset gate r and the update gate z . They jointly control 3.4 Hierarchical Reasoning Encoder t t howinformation is updated to the state. At time t, Theproposed hierarchical reasoning encoder takes the GRUcomputesthenewstateas into account that the different parts of a math ˆ h =(1−z)⊙h +z ⊙h . (1) description have no similar relevant information. t t t−1 t t−1 Moreover, determining the relevant sections in- This is a linear interpolation between the previous volves modeling the interactions among the words, ˆ state h and the current new state h computed not just their isolated presence in the text. There- t−1 t with new sequence information. The gate zt de- fore, to consider this aspect, the model includes two cides how much past information is kept and how levels of reasoning mechanisms. One reasoning at muchnewinformationisadded. zt is updated as the word level and the other at the sentence level, which let the model pay more or less attention to z =σ(W x +U h +b ), (2) individual words and sentences when constructing t z t z t−1 z where x is the sequence vector at time t. The the wholedescriptionrepresentation. Thehierarchi- t ˆ candidate state h is computed by cal reasoning encoder is composed of 2 layers. The t first layer is our word-level reasoning layer and ˆ h =tanh(W x +r ⊙(U h ) +b ), (3) the second layer is the sentence-level reasoning t h t t h t−1 h layer. In the following sections, we first introduce wherer isthereset gate which controls how much the GRU-based operation commonly used in our t the previous state contributes to the candidate state. twolayers. Then we present the details of the two If r is zero, then it forgets the past state. The reset reasoning layers. t gate is updated by GRU-basedSequenceEncoding. TheGRU(Bah- r =σ(W x +U h +b ). (4) danauetal.,2015)usesagatingmechanismtotrack t r t r t−1 r 3387
no reviews yet
Please Login to review.