jagomart
digital resources
picture1_Technology Pdf 86899 | Nfs Icdm 2019 Camera Ready


 209x       Filetype PDF       File size 0.46 MB       Source: www.microsoft.com


File: Technology Pdf 86899 | Nfs Icdm 2019 Camera Ready
neural feature search a neural architecture for automated feature engineering 1 2 2 3 4 xiangning chen qingwei lin chuan luo xudong li hongyu zhang 2 5 2 6 2 ...

icon picture PDF Filetype PDF | Posted on 14 Sep 2022 | 3 years ago
Partial capture of text on file.
                  Neural Feature Search: A Neural Architecture for
                                            Automated Feature Engineering
                                                         1,                  2, ,                  2,                 3                   4
                                      Xiangning Chen *, Qingwei Lin * **, Chuan Luo *, Xudong Li , Hongyu Zhang ,
                                                            2                     5               2              6            2
                                                  Yong Xu , Yingnong Dang , Kaixin Sui , Xu Zhang , Bo Qiao ,
                                                               2            7                         5                          2
                                                Weiyi Zhang , Wei Wu , Murali Chintalapati and Dongmei Zhang
                                                                        1Tsinghua University, China
                                                                         2Microsoft Research, China
                                                         3University of California, Los Angeles, United States
                                                                 4The University of Newcastle, Australia
                                                                      5Microsoft Azure, United States
                                                                         6Nanjing University, China
                                                              7University of Technology Sydney, Australia
                cxn15@mails.tsinghua.edu.cn, {qlin, chuan.luo, yox, yidang, kasui, boqiao, v-weiyzh, muralic, dongmeiz}@microsoft.com,
                    lixudong@ucla.edu, hongyu.zhang@newcastle.edu.au, zhangxu037@smail.nju.edu.cn, william.third.wu@gmail.com
                  Abstract—Feature engineering is a crucial step for developing            which first expands the feature set and then performs feature
               effective machine learning models. Traditionally, feature engi-             selection. In the phase of expansion, all possible or ran-
               neering is performed manually, which requires much domain                   domly sampled transformation functions are applied to raw
               knowledge and is time-consuming. In recent years, many auto-                features [1]–[3]. In the phase of reduction, features are selected
               mated feature engineering methods have been proposed. These                 based on the improvement in model performance. However,
               methods improve the accuracy of a machine learning model by
               automatically transforming the original features into a set of              the possible number of feature transformations is unbounded
               new features. However, existing methods either lack ability to              since transformation functions can be composed and applied
               perform high-order transformations or suffer from the feature               recursively to features generated by previous transformation
               space explosion problem. In this paper, we present Neural                   functions. The space of constructed features could grow expo-
               Feature Search (NFS), a novel neural architecture for automated             nentially (e.g., the possible number of features constructed by
               feature engineering. We utilize a recurrent neural network
               based controller to transform each raw feature through a series             division based on n features is O(n2). It will become O(n4)
               of transformation functions. The controller is trained through              by applying another transformation function on newly gener-
               reinforcement learning to maximize the expected performance                 ated features). As a result, expansion-reduction suffers from
               of the machine learning algorithm. Extensive experiments on                 the feature explosion problem. Recently, the paper [4] builds
               public datasets illustrate that our neural architecture is effective        a transformation graph and employs Q-learning [5] to search
               and outperforms the existing state-of-the-art automated feature
               engineering methods. Our architecture can efficiently capture                for feature transformations. The resulting machine learning
               potentially valuable high-order transformations and mitigate the            model built using the transformed features can achieve higher
               feature explosion problem.                                                  performance. However, it also has an exponential number of
                  Keywords-Feature Engineering, Automated Feature Engineer-                features in the bottom layer despite the introduction of the
               ing, Neural Architecture                                                    feature selection operator. LFE [6] manages to eliminate the
                                        I. INTRODUCTION                                    problem by recommending a few promising transformation
                                                                                           functions. However, it has a performance bottleneck since they
                  Feature Engineering is a crucial step for building an ef-                do not consider a composition of transformations, i.e. high-
               fective machine learning model. However, traditional feature                order transformations, which have been demonstrated valuable
               engineering is largely a manual process, which is time-                     by our experiments.
               consuming and requires much domain knowledge. The tra-                         This paper presents Neural Feature Search (NFS), a neu-
               ditional practice is also prone to bias and error as there is a             ral architecture for automated feature engineering. In our
               lack of standard procedures to perform feature engineering.                 architecture, we address the feature explosion problem by
                  In recent years, many automated feature engineering meth-                employing an effective searching algorithm to explore the
               ods have been proposed. These methods apply transformation                  most promising transformation rule (a series of transformation
               functions such as arithmetic operators to the raw features and              functions) for each raw feature. That is, we only perform
               generate the new, engineered features. Generally, automated                 the most valuable transformation rather than expansion and
               feature engineering can be realized by expansion-reduction,                 reduction. We support a number of transformation functions
                 ∗Xiangning Chen, Qingwei Lin and Chuan Luo contributed equally to this    such as logarithm, square root, and multiply. We also support
               work and share the first authorship of this work.                            two special functions delete and terminate, which perform
                 ∗∗Qingwei Lin is the corresponding author of this work.                   feature selection and terminate the transformation process, re-
             spectively. In this way, the process of finding optimal features   knowledge-based methods imitate traditional manual feature
             for the whole dataset can be divided into finding a series of      engineering and relies heavily on domain knowledge.
             transformation functions for each raw feature. We design a           Recently, the general framework of expansion-reduction is
             Recurrent Neural Network (RNN) based controller to generate       proposed for automated feature engineering. It first expands
             such a series. The controller is trained through reinforcement    the feature space and then performs feature selection to reduce
             learning to maximize the expected performance of the machine      redundancy. FEADIS [1] randomly selects raw features and
             learning algorithm.                                               mathematical functions to generate new features. The Data
               As an example, consider the problem of predicting heart         Science Machine (DSM) [10] includes a Deep Feature Synthe-
             disease based on the features weight and height. Al-              sis component, which synthesizes new features by recursively
             though the raw features could be useful indicators for this       enumerating all possible combinations. DSM can generate a
             problem, a more effective feature is Body Mass Index              large number of features per entity. To reduce the size of the
             (BMI) [7], which is defined as BMI = weight. Such                  feature space, it performs feature selection via truncated SVD
                                                              height2
             a new feature is actually a series of transformation func-        transformation. [3] adopts a similar approach and develops
             tions [square,reciprocal,×weight] based on the raw feature        a system called One Button Machine for automating feature
             height: reciprocal(square(height)) × weight. NFS is able          engineering in relational databases. Instead of utilizing trans-
             to generate this third-order new feature by composing the raw     formation functions, AutoLearn [11] generates and selects new
             features through multiple functions.                              features by applying regularized regression models to feature
               To evaluate the proposed approach, we have conducted            pairs. Being data-driven, it requires no domain knowledge.
             extensive experiments on 41 public datasets, which have           However, learning a regression model on every feature pair
             different numbers of features and instances and cover both        is time-consuming. Also, as mentioned above, expansion-
             classification and regression problems. The results show that      reduction based approaches cannot handle the feature space
             NFS is effective and outperforms the existing state-of-the-art    explosion problem.
             automated feature engineering methods. NFS has achieved the          Like the expansion-reduction approach, ExploreKit [2] per-
             best performance on 37 out of 41 datasets. The experimental       forms all transformation functions on the whole dataset. It then
             results also demonstrate that the performance of our approach     chooses the most promising features by ranking the features
             continues to increase when the order of transformations in-       through supervised meta-learning. This method also suffers
             creases.                                                          from the feature space explosion problem and is limited by
               Our main contributions can be summarized as follows:            the effectiveness of meta-learning, which demands massive
               • We propose a novel neural architecture for automated          amount of training data. Due to the complex nature of this
                  feature engineering. To our best knowledge, it is the        method, it does not allow for functional compositions. Another
                  first work that addresses the feature explosion problem       plausible work is Learning Feature Engineering (LFE) [6],
                  and supports high-order transformations through deep         which recommends the most promising transformation func-
                  learning.                                                    tion for features via offline meta-learning. This method is fast
               • We have conducted extensive experiments on public             in computation but is constrained to classification problems.
                  datasets. The experiments show that our approach signif-     Most importantly, LFE does not take into account high-order
                  icantly outperforms the existing state-of-the-art methods    transformations and thus is not able to generate complex
                  for automated feature engineering.                           features.
               The rest of this paper is organized as follows. In Section II,     There are also search-based feature engineering methods.
             we review the related work on automated feature engineering       For example, [12] employs genetic algorithm to determine
             and neural network controller. In Section III, we propose         a proper single transformation function for a given dataset.
             NFS and use it to find the best feature transformations. In        Cognito [13] introduces the notion of a tree-like exploration of
             Section IV, we perform extensive experiments to show the          transform space, and presents some heuristic search strategies
             effectiveness of NFS. Section V briefly introduces a success       such as depth-first traversal. [4] is a related work of Cognito,
             story of our work, and Section VI concludes this paper.           which is based on performance driven exploration of a trans-
                                                                               formation graph. It utilizes Q-learning to systematically and
                                  II. RELATED WORK                             compactly enumerates the space of given options. It is the
               In this section, we briefly review the related work on           current state of the art and achieves better performance than
             automated feature engineering and neural network controller.      other related approaches. However, it still suffers from the
             A. Automated Feature Engineering                                  feature explosion problem since the number of features grows
                                                                               exponentially in the hierarchically structured transformation
               Many automated feature engineering methods are based on         graph, especially at the bottom of the graph.
             domain knowledge. For example, the work presented in [8]             DeepNeuralNetworkhasthecapability of performing com-
             automates feature engineering by constructing features from       plex feature transformation as well. However, deep learning
             semantic knowledge bases. Similarly, [9] utilizes the explicitly  methods require massive amount of training data and face
             captured relations and dependencies modeled in a knowledge        overfitting problem when it comes to real-world small and
             graph to perform feature selection and construction. The          unbalanced datasets. Most importantly, the learned implicit
                 feature transformations lack interpretability. In contrast, NFS
                 outputs human readable features as insights into the data via
                 explicit transformation functions.
                    Our approach explores the most promising transformation
                 rule for each raw feature, which is time-efficient and eliminates
                 feature space explosion problem. The discovered transforma-
                 tion rule can be very complex since new features can be
                 generated through a series of functional composition.
                 B. Neural Network Controller
                    NAS [14] proposes a general framework to search for the
                 architecture of a deep neural network, which employs a recur-
                 rent neural network controller to generate a string. Training
                 the network specified by the string will result in an accuracy
                 on a validation set. Using this accuracy as reward signal,                                           Fig. 1.   An Overview of Neural Feature Search
                 NAS utilizes policy gradient to update the controller. Apart
                 from searching for the neural network architectures, Searching                           Formally, the goal of automated feature engineering is
                 for Activation Functions [15] employs a similar controller to                         to search the optimal transformed feature set F∗ where
                 search for activation function named Swish. A Hierarchical                            VE(F∗,y) is maximized, F∗ = argmax VE(F,y).
                 Model for Device Placement [16] uses hierarchical controller                            L                                                       L
                 to search for optimal placement of computational graphs onto                                                                           F
                 hardware devices. In this paper, we apply a neural controller                         B. An Overview of the Proposed Approach
                 to generate transformation rule (i.e. a series of transformation                         In this paper, we propose Neural Feature Search (NFS),
                 functions) for each raw feature.                                                      a neural architecture for automated feature engineering.
                                       III. PROPOSED APPROACH                                          We employ a Recurrent Neural Network (RNN) to gen-
                                                                                                       erate a transformation rule A (i.e., a series of transfor-
                    In this section, we propose and introduce Neural Feature                           mation functions with the maximum order of T, such as
                 Search (NFS), a novel neural architecture for automated fea-                          reciprocal(square(height)) × weight) for each raw feature
                 ture engineering.                                                                     within T time steps. That is, rather than performing transfor-
                 A. Problem Formulation                                                                mation on the whole dataset or on the sampled features, we
                                                                                                       aim to find the most promising transformation rule for every
                    Given a dataset D = hF,yi that consists of raw features                            feature. One big advantage of our approach is that rather than
                 F ={f ,f ,...,f } with a target vector y, we use V E(F,y)                             adopting a hierarchical structure which is likely to introduce
                           1   2        n                                               L
                 to denote the performance of a machine learning algorithm L                           feature space explosion problem, our architecture performs
                 (e.g. Random Forest or Logistic Regression) constructed on                            the most promising transformation for each feature in parallel
                 F and measured by an evaluation metric E (e.g. F1-score or                            through an efficient search algorithm.
                 mean squared error).                                                                     Figure 1 gives an overview of the NFS architecture. For
                                                                                                 ˆ     datasets with n raw features, we build n RNNs as our
                    In addition, we transform a raw feature set F into F
                 through a set of transformation functions {A ,A ,...,A }. As                          controller to generate n transformation rules {A ,A ,...,A }
                                                                           1     2        n                                                                            1    2         n
                 described in [4], [6], there are two categories of mathematical                       in parallel, where Ai is related to raw feature fi. Guided
                 transformation functions: unary functions that are performed                          by the transformation rules, we transform the raw feature
                                                                                                                        ˆ                                               E ˆ
                 on a single feature (e.g. logarithm and square) and binary                            set F into F and compute its performance V (F,y) with
                                                                                                                                                                       L
                 functions that are performed on a pair of features (e.g. plus                         respect to a learning algorithm L and an evaluation metric E.
                 and minus). Note that transformations can be performed on                             Next, we utilize policy gradient [17] to train our controller
                                                                                                                                    E ˆ               E
                 three or more features, since it is equivalent to applying binary                     via reward signal VL (F,y) − VL (F,y), which represents
                 transformations recursively.                                                          the performance improvement of those transformation rules.
                    Generally, feature transformation should be able to perform                        Figure 2 illustrates how each RNN predicts a transformation
                 the following two activities: (1) generating new features via                         rule. A transformation function is predicted at each time step
                 transforming the raw features. Note that transformation func-                         and they together form a transformation rule. Every prediction
                 tions can be composed to form a higher-order transformation                           is carried out by a softmax classifier and then fed into the next
                 (e.g. BMI = weight , a third-order transformation, is formed                          time step as input.
                                            2
                                    height
                 by transforming the feature height through a square function,                            In the following subsection, we will first show how our
                 then a reciprocal function, and subsequently a multiply                               transformation rules transform raw features. We then describe
                 function of feature weight). (2) deleting existing features                           our RNN-based controller, which generates transformation
                 through feature selection.                                                            rules. Finally, we show how the proposed NFS architecture
                                                                                            Algorithm 1 Dataset Transformation
                                                                                            Input: Dataset D = {f ,f ,...,f } with n raw features;
                                                                                                                        1   2       n
                                                                                            Transformation Rules A = {A ,A ,...,A }, where each A =
                                                                                                                               1    2        n                   i
                                                                                            [a ,a ,...,a ];
                                                                                              1   2       T
                                                                                                                        ˆ
                                                                                            Output: New Dataset D after the transformation;
                                                                                             1: for each f ∈ D do
                                                                                                             i
                                                                                                                       ˆ
                                                                                             2:     tailor A to be A = [a ,a ,..,a ], where a                  is the
                                                                                                             i           i       1   2      t            t+1
                                                                                                    first terminate transformation in Ai;
                                                                                             3:     if t == 0 then
                                                                                             4:        continue;
                                                                                             5:     else
                                                                                                                  ˆ
                                                                                             6:        if ∃a ∈ A = delete then
                                                                                                            i      i
                                                                                             7:          delete fi from D;
                                                                                             8:        else
                                                                                                          ˆ
                                                                                             9:          f =a (a         ... (a (a (f ))));
                      Note: Each RNN predicts transformation function a at time                            i   ˆt    t−1       2    1   i
                                                                        t                   10:          add f to D;
                      step t. A series of transformations [a ,a ,..,a ] form a                                  i
                                                            1  2     T                      11:        end if
                      transformation rule. Every prediction is carried out by a             12:     end if
                      softmax classifier and then fed into the next time step.               13: end for
                       Fig. 2.  The Generation of Transformation Rule by a RNN
               can be effectively trained with policy gradient to maximize                  dataset as shown in Algorithm 1. In order to reduce the feature
               the expected performance of the learning algorithm.                          space, NFS first conducts feature selection on the raw dataset
               C. Feature Reconstruction through High-order Transforma-                     by selecting β features according to the feature importance via
               tions                                                                        random forest [18] if the number of features of the raw dataset
                  As mentioned, there are two types of transformations:                     is greater than β. Given a raw feature fi and its corresponding
                                                                                                                                                              ˆ
                                                                                            transformation rule A , NFS first tailors A               to be A by
               unary and binary. Our approach focuses on finding the most                                               i                           i            i
               promising transformation rule A for each feature. However,                   removing all actions after the first terminate transformation
                                                                                             ˆ
               binary transformations take two features as input and thus                   (Ai = Ai if terminate does not exist) (line 2). This enables
               cannot be performed on a single feature. Therefore, we unify                 NFS to automatically find the proper length of transformation
               a binary transformation by converting it into a set of unary                 rule A, which is the order of transformation. If the length of
                                                                                             ˆ
               transformations for each individual feature f . For example,                 Ai is 0, which means that the first element of Ai is terminate,
                                                                     i                                                                                        ˆ
               for the binary transformation plus, we convert it into a set of              then no feature is generated or eliminated (line 4). If Ai has
                                                                                                                                                             ˆ
               unary transformation addition(f ).                                           length T > 0 and there exists an delete element in Ai, we
                                                      i                                     then delete the raw feature fi (line 7). Otherwise, we generate
                  We have two special unary transformations: delete and                                      ˆ
                                                                                            a new feature f = a (a           ... (a (a (f )))) (line 9), where t
               terminate. delete removes the corresponding feature and termi-                                 i      t   t−1      2    1   i
                                                                                                               ˆ                       ˆ
               nate indicates a stop command for the current transformation.                is the length of Ai. Finally, we add fi to the dataset (line 10).
               Such two transformations enable our architecture to eliminate
               features and stop at the most suitable order of transformation                  NFS traverses and transforms every feature in parallel,
               (i.e., the number of feature composition), respectively.                     which facilitates high-order transformations. We make each
                  For every raw feature, we utilize a series of transformation              binary transformation to n unary transformations and reduce
               functions A       =[a ,a ,...,a ] to represent a transformation                                               n T             2T               T
                             1:T       1   2       T                                        the searching space from ( 2 )            = O(n ) to n × n             =
               rule, where the length of the series T is the max order of                   O(nT+1), where n is the number of raw features and T is the
               transformations it can take. Each element in the series is                   maxtransformation order. Within the reduced searching space,
               a unary transformation or a unified binary transformation.                    our architecture explores the most promising transformation
               For dataset with n raw features, we have n transformation                    for each raw feature. Therefore, NFS does not suffer from the
               rules, and the transformation process for each raw feature is                feature explosion problem and can capture high-order trans-
               independent of each other since we have already captured                     formation meanwhile. Our architecture also supports feature
               the correlation between features by unifying every binary                    selection because of the special action delete. Furthermore,
               transformation. As a result, we can perform transformation                   the terminate function enables NFS to automatically stop at a
               of each feature in parallel.                                                 certain time step, so transformation rules of variable lengths
                  Guided by all transformation rules, NFS transforms the raw                can be generated.
The words contained in this file might help you see if this file matches what you are looking for:

...Neural feature search a architecture for automated engineering xiangning chen qingwei lin chuan luo xudong li hongyu zhang yong xu yingnong dang kaixin sui bo qiao weiyi wei wu murali chintalapati and dongmei tsinghua university china microsoft research of california los angeles united states the newcastle australia azure nanjing technology sydney cxn mails edu cn qlin yox yidang kasui boqiao v weiyzh muralic dongmeiz com lixudong ucla au zhangxu smail nju william third gmail abstract is crucial step developing which rst expands set then performs effective machine learning models traditionally engi selection in phase expansion all possible or ran neering performed manually requires much domain domly sampled transformation functions are applied to raw knowledge time consuming recent years many auto features reduction selected mated methods have been proposed these based on improvement model performance however improve accuracy by automatically transforming original into number transform...

no reviews yet
Please Login to review.