136x Filetype PDF File size 0.70 MB Source: aclanthology.lst.uni-saarland.de
Point to the Expression: Solving Algebraic Word Problems using the Expression-Pointer Transformer Model BugeunKim KyungSeoKi DonggeonLee GahgeneGweon Department of Transdisciplinary Studies, Seoul National University, Seoul, Republic of Korea {cd4209,kskee88,lsw5835,ggweon}@snu.ac.kr Abstract Problem One number is eight more than Solving algebraic word problems has recently twice another and their sum is 20. emerged as an important natural language pro- Whataretheir numbers? cessing task. To solve algebraic word prob- Numbers 1(‘one’), 8(‘eight’), 2(‘twice’), 20. lems, recent studies suggested neural mod- Equations x −2x =8,x +x =20 0 1 0 1 els that generate solution equations by using Answers (16,4) ‘Op (operator/operand)’ tokens as a unit of input/output. However, such a neural model Table 1: A sample algebraic word problem suffered two issues: expression fragmentation and operand-context separation. To address each of these two issues, we propose a pure et al., 2018; Amini et al., 2019; Chiang and Chen, neuralmodel,Expression-PointerTransformer 2019; Wang et al., 2019). However, suggested (EPT), which uses (1) ‘Expression’ token and neural models showed a fairly large performance (2) operand-context pointers when generat- ing solution equations. The performance of gap compared to existing state-of-the-art models the EPT model is tested on three datasets: based on hand-crafted features in popular algebraic ALG514, DRAW-1K, and MAWPS. Com- wordproblemdatasets,suchasALG514(44.5%for pared to the state-of-the-art (SoTA) models, pureneuralmodelvs. 83.0%forusinghand-crafted the EPT model achieved a comparable perfor- features) (Huangetal.,2018;UpadhyayandChang, mance accuracy in each of the three datasets; 2016). To address the large performance gap in 81.3% on ALG514, 59.5% on DRAW-1K, this study, we propose a larger unit of input/output and 84.5% on MAWPS. The contribution of this paper is two-fold; (1) We propose a (I/O) token called “Expressions” for a pure neural pure neural model, EPT, which can address model. Figure1illustratesconventionallyused“Op the expression fragmentation and the operand- (operator/operands)” versus our newly proposed context separation. (2) The fully automatic “Expression” token. EPT model, which does not use hand-crafted To improve the performance of pure neural features, yields comparable performance to models that can solve algebraic word problems, we existing models using hand-crafted features, and achieves better performance than existing identified two issues that can be addressed using pure neural models by at most 40%. Expression tokens, which are shown in Figure 1 Introduction 1: (1) expression fragmentation and (2) operand- context separation. First, the expression fragmen- Solving algebraic word problems has recently tation issue is a segmentation of an expression become an important research task in that auto- tree, which represents a computational structure matically generating solution equations requires of equations that are used to generate a solution. understanding natural language. Table 1 shows This issue arises when Op, rather than the whole a sample algebraic word problem, along with expression tree, is used as an input/output unit corresponding solution equations that are used of a problem-solving model. For example, as to generate answers for the problem. To solve shown in Figure 1 (a), using Op tokens as an such problems with deep learning technology, input to a problem-solving model disassembles a researchers recently suggested neural models that tree structure into operators (“×”) and operands generate solution equations automatically (Huang (“x1” and “2”). Meanwhile, we propose using the 3768 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3768–3779, c November16–20,2020. 2020Association for Computational Linguistics Figure 1: Illustration using the word problem in Table 1 for the (a) expression fragmentation issue, (b) operand- context separation issue, and (c) our solution for these two issues. “Expression”(×(x1,2))token,whichcanexplicitly two studies are presented. Section 5.1 presents a capture a tree structure as a whole, as shown in performancecomparisonbetweenEPTandexisting Figure 1 (c). SoTA models. Section 5.2 presents an ablation Thesecondissue of operand-context separation study examining the effects of Expression tokens is the disconnection between an operand and and applying operand-context pointers. Finally, in a number that is associated with the operand. Section 6, a conclusion is presented with possible This issue arises when a problem-solving model future directions for our work. substitutes a number stated in an algebraic word problem into an abstract symbol for generalization. 2 Related work AsshowninFigure1(b),whenusinganOptoken, the number 8 is changed into an abstract symbol Our goal is to design a pure neural model that ‘N1’. Meanwhile, when using an Expression token, generates equations using ‘Expression’ tokens to the number 8 is not transformed into a symbol. solve algebraic word problems. Early attempts Rather a pointer is made to the location where the for solving algebraic word problems noted the number8occurredinanalgebraic word problem. importance of Expressions in building models with Therefore, using such an “operand-context pointer” hand-crafted features (Kushman et al., 2014; Roy enables a model to access contextual information et al., 2015; Roy and Roth, 2015; Zhou et al., 2015; about the number directly, as shown in Figure 1 (c); Upadhyay et al., 2016). However, recent neural thus, the operand-context separation issue can be models have only utilized ‘Op (operator/operand)’ addressed. tokens (Wang et al., 2017; Amini et al., 2019; In this paper, we propose a pure neural model Chiang and Chen, 2019; Huang et al., 2018; Wang called Expression-Pointer Transformer (EPT) to et al., 2019), resulting in two issues: (1) the address the two issues above. The contribution of expression fragmentation issue and (2) the operand- this paper is two-fold; context separation issue. In the remaining section, wepresent existing methods for tackling each of 1. Wepropose a pure neural model, Expression- these two issues. PointerTransformer(EPT),whichcanaddress To address the expression fragmentation issue, the expression fragmentation and operand- researchers tried to reflect relational information context separation issues. between operators and operands either by using a 2. The EPT model is the first pure neural model two-step procedure or a single step with sequence- that showed comparable accuracy to the exist- to-sequence models. Earlier attempts predicted ing state-of-the-art models, which used hand- operators and their operands by using a two-step crafted features. Compared to the state-of- procedure. Such early models selected operators the-art pure neural models, the EPT achieves first by classifying a predefined template (Kushman better performance by about 40%. et al., 2014; Zhou et al., 2015; Upadhyay et al., 2016), then in the second step, operands were ap- In the rest of the paper, we introduce existing plied to the template selected in the first step. Other approaches to solve algebraic word problems models selected operands first before constructing in Section 2. Next, Section 3 introduces our expression trees with operators in the second step proposed model, EPT, and Section 4 reports the (Royet al., 2015; Roy and Roth, 2015). However, experimental settings. Then in Section 5, results of such two-step procedures in these early attempts 3769 Input Output Expression token Meaning position index Operator (f ) Operand 0 (a ) Operand 1 (a ) i i0 i1 0 - BEGIN (Start an equation) 1 R VAR (Generate variable x ) 0 0 2 R VAR (Generate variable x ) 1 1 3 R × 2 R 2x 2 1 1 4 R − R R x −2x 3 0 2 0 1 5 R = R 8 x −2x =8 4 3 0 1 6 R + R R x +x 5 0 1 0 1 7 R = R 20 x +x =20 6 5 0 1 - R7 END (Gather all equations) Table 2: The Expression token sequence for x − 2x = 8 and x +x = 20 0 1 0 1 can be performed via a single-step procedure with Amini et al., 2019). For example, Huang et al. neural models. Specifically, recent attempts have (2018) used a pointer-generator network that can utilized sequence-to-sequence (seq2seq) models as point to the context of a number in a given math a single-step procedure to learn the implicit rela- problem. Although Huang’s model can address the tionship between operators and operands (Amini operand-context separation issue using pointers, et al., 2019; Chiang and Chen, 2019; Wang et al., their pure neural model did not yield a comparable 2019). For example, to capture the operator- performance to the state-of-the-art model using operand relationship, Chiang and Chen (2019) hand-crafted features (44.5% vs. 83.0%). In this constructed a seq2seq model that used push/pop paper, we propose that by including additional actions on a stack for generating operator/operand pointers that utilize the contextual information tokens. Similarly, Amini et al. (2019) built a of operands and neighboring Expression tokens, seq2seq model to generate an operator token right performance of pure neural models can improve. after producing required operand tokens. However, 3 EPT:Expression-Pointer Transformer although these seq2seq approaches consider rela- tional information of operands when generating Figure 2 shows the proposed Expression-Pointer operators, the approach still does not address Transformer (EPT)1 model, which adopts the the problem of lacking relation information of encoder-decoder architecture of a Transformer operators when generating operands. On the other model (Vaswani et al., 2017). The EPT utilizes hand, by using Expression token, our model can the ALBERTmodel(Lanetal.,2019),apretrained consider relational information when generating language model, as the encoder. The encoder input both operator and operands. is tokenized words of the given word problem, and Secondly, there were efforts to address the encoderoutputistheencoder’shidden-statevectors operand-context separation issue. To utilize contex- that denote numeric contexts of the given problem. tual information of an operand token, researchers After obtaining the encoder’s hidden-state vec- built hand-craftedfeaturesthatcapturethesemantic tors from the ALBERT encoder, the transformer content of a word, such as the unit of a given num- decoder generates ‘Expression’ tokens. The two ber (Roy and Roth, 2015; Koncel-Kedziorski et al., decoder inputs are Expression tokens and the 2015; Zhou et al., 2015; Upadhyay et al., 2016; ALBERT encoder’s hidden-state vectors, which Roy and Roth, 2017) or dependency relationship are used as memories. For the given example between numbers (Kushman et al., 2014; Zhou problem, the input is a list of 8 Expression tokens et al., 2015; Upadhyay et al., 2016). However, shown in Table 2. We included three special devising hand-crafted input features was time- commands in the list: VAR (generate a variable), consuming and required domain expertise. There- BEGIN (start an equation), and END (gather all fore, recent approaches have employed distributed equations). Following the order specified in the list representations and neural models to learn numeric of Table 2, the EPT receives one input Expression context of operands automatically (Wang et al., 1The code is available on https://github.com/ 2017; Huang et al., 2018; Chiang and Chen, 2019; snucclab/ept. 3770 Figure 2: The architecture of Expression-Pointer Transformer (EPT) where two ideas applied: (1) Expression token and (2) operand-context pointer. at a time. For the ith Expression input, the model The embedding vector a , which represents ij computes an input vector vi. The EPT’s decoder the jth operand of ith Expression, is calculated then transforms this input vector to a decoder’s differently according to the operand aij’s source. hidden-state vector d . Finally, the EPT predicts Toreflectcontextualinformationofoperands, three i the next Expression token by generating the next possible sources are utilized: problem-dependent operator and operands simultaneously. numbers, problem-independent constants, and the To produce ‘Expression’ tokens, two compo- result of prior Expression tokens. First, problem- nents are modified from the vanilla Transformer: dependent numbers are numbers provided in an input vector and output layer. In the following algebraic problem (e.g., ‘20’ in Table 1). To subsections, we explain the two components. compute a of a number, we reuse the encoder’s ij hidden-state vectors corresponding to such number 3.1 Input vector of EPT’s decoder tokens as follows: The input vector vi of ith Expression token is a =LN c u +e , (3) obtained by combining operator embedding f and ij a a num aij i operand embedding a as follows: where u denotes a vector representing the source, ij and e ∗is the encoder’s hidden-state vector cor- aij v =FF (Concat(f ,a ,a ,···,a )), (1) 2 i in i i1 i2 ip responding to the number a . Second, problem- ij independent constants are predefined numbers that where FF indicates a feed-forward linear layer, are not stated in the problem (e.g., 100 is often used ∗ and Concat(·) means concatenation of all vectors for percentiles). To compute aij of a constant, we inside the parentheses. All the vectors, including use a look-up table Ec as follows: v , f , and a , have the same dimension D. For- i i ij a =LN (c u +E(a )). (4) mulae for computing the two types of embedding ij a a const c ij vectors, f and a are stated in the next paragraph. i ij Note that LN , c are shared across different For the operator token f of ith Expression, the a a i sources. Third, the result of the prior Expression EPTcomputestheoperator embedding vector fi as token is an Expression generated before the ith in Vaswani et al. (2017)’s setting: Expression (e.g., R ). To compute a of a result, 0 ij 3 weutilize the positional encoding as follows : f =LN (c E (f )+PE(i)), (2) i f f f i a =LN (c u +PE(k)), (5) ij a a expr where E∗(·) indicates a look-up table for embed- 2Whentwoormoretokensformanumberintheproblem, ding vectors, c denotes a scalar parameter, and weaveraged all related hidden-state vectors. ∗ 3Since we want to sustain simultaneous decoding, which LN (·) and PE(·) represent layer normalization ∗ is one of the strengths in the Transformer, we use PE(k) for (Ba et al., 2016) and positional encoding (Vaswani the kth prior Expression, although it is possible to use decoder et al., 2017), respectively. hidden state dk. 3771
no reviews yet
Please Login to review.