Technology Pdf 86899 | Nfs Icdm 2019 Camera Ready

Partial capture of text on file.

Neural Feature Search: A Neural Architecture for
Automated Feature Engineering
1, 2, , 2, 3 4
Xiangning Chen *, Qingwei Lin * **, Chuan Luo *, Xudong Li , Hongyu Zhang ,
2 5 2 6 2
Yong Xu , Yingnong Dang , Kaixin Sui , Xu Zhang , Bo Qiao ,
2 7 5 2
Weiyi Zhang , Wei Wu , Murali Chintalapati and Dongmei Zhang
1Tsinghua University, China
2Microsoft Research, China
3University of California, Los Angeles, United States
4The University of Newcastle, Australia
5Microsoft Azure, United States
6Nanjing University, China
7University of Technology Sydney, Australia
cxn15@mails.tsinghua.edu.cn, {qlin, chuan.luo, yox, yidang, kasui, boqiao, v-weiyzh, muralic, dongmeiz}@microsoft.com,
lixudong@ucla.edu, hongyu.zhang@newcastle.edu.au, zhangxu037@smail.nju.edu.cn, william.third.wu@gmail.com
Abstract—Feature engineering is a crucial step for developing which ﬁrst expands the feature set and then performs feature
effective machine learning models. Traditionally, feature engi- selection. In the phase of expansion, all possible or ran-
neering is performed manually, which requires much domain domly sampled transformation functions are applied to raw
knowledge and is time-consuming. In recent years, many auto- features [1]–[3]. In the phase of reduction, features are selected
mated feature engineering methods have been proposed. These based on the improvement in model performance. However,
methods improve the accuracy of a machine learning model by
automatically transforming the original features into a set of the possible number of feature transformations is unbounded
new features. However, existing methods either lack ability to since transformation functions can be composed and applied
perform high-order transformations or suffer from the feature recursively to features generated by previous transformation
space explosion problem. In this paper, we present Neural functions. The space of constructed features could grow expo-
Feature Search (NFS), a novel neural architecture for automated nentially (e.g., the possible number of features constructed by
feature engineering. We utilize a recurrent neural network
based controller to transform each raw feature through a series division based on n features is O(n2). It will become O(n4)
of transformation functions. The controller is trained through by applying another transformation function on newly gener-
reinforcement learning to maximize the expected performance ated features). As a result, expansion-reduction suffers from
of the machine learning algorithm. Extensive experiments on the feature explosion problem. Recently, the paper [4] builds
public datasets illustrate that our neural architecture is effective a transformation graph and employs Q-learning [5] to search
and outperforms the existing state-of-the-art automated feature
engineering methods. Our architecture can efﬁciently capture for feature transformations. The resulting machine learning
potentially valuable high-order transformations and mitigate the model built using the transformed features can achieve higher
feature explosion problem. performance. However, it also has an exponential number of
Keywords-Feature Engineering, Automated Feature Engineer- features in the bottom layer despite the introduction of the
ing, Neural Architecture feature selection operator. LFE [6] manages to eliminate the
I. INTRODUCTION problem by recommending a few promising transformation
functions. However, it has a performance bottleneck since they
Feature Engineering is a crucial step for building an ef- do not consider a composition of transformations, i.e. high-
fective machine learning model. However, traditional feature order transformations, which have been demonstrated valuable
engineering is largely a manual process, which is time- by our experiments.
consuming and requires much domain knowledge. The tra- This paper presents Neural Feature Search (NFS), a neu-
ditional practice is also prone to bias and error as there is a ral architecture for automated feature engineering. In our
lack of standard procedures to perform feature engineering. architecture, we address the feature explosion problem by
In recent years, many automated feature engineering meth- employing an effective searching algorithm to explore the
ods have been proposed. These methods apply transformation most promising transformation rule (a series of transformation
functions such as arithmetic operators to the raw features and functions) for each raw feature. That is, we only perform
generate the new, engineered features. Generally, automated the most valuable transformation rather than expansion and
feature engineering can be realized by expansion-reduction, reduction. We support a number of transformation functions
∗Xiangning Chen, Qingwei Lin and Chuan Luo contributed equally to this such as logarithm, square root, and multiply. We also support
work and share the ﬁrst authorship of this work. two special functions delete and terminate, which perform
∗∗Qingwei Lin is the corresponding author of this work. feature selection and terminate the transformation process, re-
spectively. In this way, the process of ﬁnding optimal features knowledge-based methods imitate traditional manual feature
for the whole dataset can be divided into ﬁnding a series of engineering and relies heavily on domain knowledge.
transformation functions for each raw feature. We design a Recently, the general framework of expansion-reduction is
Recurrent Neural Network (RNN) based controller to generate proposed for automated feature engineering. It ﬁrst expands
such a series. The controller is trained through reinforcement the feature space and then performs feature selection to reduce
learning to maximize the expected performance of the machine redundancy. FEADIS [1] randomly selects raw features and
learning algorithm. mathematical functions to generate new features. The Data
As an example, consider the problem of predicting heart Science Machine (DSM) [10] includes a Deep Feature Synthe-
disease based on the features weight and height. Al- sis component, which synthesizes new features by recursively
though the raw features could be useful indicators for this enumerating all possible combinations. DSM can generate a
problem, a more effective feature is Body Mass Index large number of features per entity. To reduce the size of the
(BMI) [7], which is deﬁned as BMI = weight. Such feature space, it performs feature selection via truncated SVD
height2
a new feature is actually a series of transformation func- transformation. [3] adopts a similar approach and develops
tions [square,reciprocal,×weight] based on the raw feature a system called One Button Machine for automating feature
height: reciprocal(square(height)) × weight. NFS is able engineering in relational databases. Instead of utilizing trans-
to generate this third-order new feature by composing the raw formation functions, AutoLearn [11] generates and selects new
features through multiple functions. features by applying regularized regression models to feature
To evaluate the proposed approach, we have conducted pairs. Being data-driven, it requires no domain knowledge.
extensive experiments on 41 public datasets, which have However, learning a regression model on every feature pair
different numbers of features and instances and cover both is time-consuming. Also, as mentioned above, expansion-
classiﬁcation and regression problems. The results show that reduction based approaches cannot handle the feature space
NFS is effective and outperforms the existing state-of-the-art explosion problem.
automated feature engineering methods. NFS has achieved the Like the expansion-reduction approach, ExploreKit [2] per-
best performance on 37 out of 41 datasets. The experimental forms all transformation functions on the whole dataset. It then
results also demonstrate that the performance of our approach chooses the most promising features by ranking the features
continues to increase when the order of transformations in- through supervised meta-learning. This method also suffers
creases. from the feature space explosion problem and is limited by
Our main contributions can be summarized as follows: the effectiveness of meta-learning, which demands massive
• We propose a novel neural architecture for automated amount of training data. Due to the complex nature of this
feature engineering. To our best knowledge, it is the method, it does not allow for functional compositions. Another
ﬁrst work that addresses the feature explosion problem plausible work is Learning Feature Engineering (LFE) [6],
and supports high-order transformations through deep which recommends the most promising transformation func-
learning. tion for features via ofﬂine meta-learning. This method is fast
• We have conducted extensive experiments on public in computation but is constrained to classiﬁcation problems.
datasets. The experiments show that our approach signif- Most importantly, LFE does not take into account high-order
icantly outperforms the existing state-of-the-art methods transformations and thus is not able to generate complex
for automated feature engineering. features.
The rest of this paper is organized as follows. In Section II, There are also search-based feature engineering methods.
we review the related work on automated feature engineering For example, [12] employs genetic algorithm to determine
and neural network controller. In Section III, we propose a proper single transformation function for a given dataset.
NFS and use it to ﬁnd the best feature transformations. In Cognito [13] introduces the notion of a tree-like exploration of
Section IV, we perform extensive experiments to show the transform space, and presents some heuristic search strategies
effectiveness of NFS. Section V brieﬂy introduces a success such as depth-ﬁrst traversal. [4] is a related work of Cognito,
story of our work, and Section VI concludes this paper. which is based on performance driven exploration of a trans-
formation graph. It utilizes Q-learning to systematically and
II. RELATED WORK compactly enumerates the space of given options. It is the
In this section, we brieﬂy review the related work on current state of the art and achieves better performance than
automated feature engineering and neural network controller. other related approaches. However, it still suffers from the
A. Automated Feature Engineering feature explosion problem since the number of features grows
exponentially in the hierarchically structured transformation
Many automated feature engineering methods are based on graph, especially at the bottom of the graph.
domain knowledge. For example, the work presented in [8] DeepNeuralNetworkhasthecapability of performing com-
automates feature engineering by constructing features from plex feature transformation as well. However, deep learning
semantic knowledge bases. Similarly, [9] utilizes the explicitly methods require massive amount of training data and face
captured relations and dependencies modeled in a knowledge overﬁtting problem when it comes to real-world small and
graph to perform feature selection and construction. The unbalanced datasets. Most importantly, the learned implicit
feature transformations lack interpretability. In contrast, NFS
outputs human readable features as insights into the data via
explicit transformation functions.
Our approach explores the most promising transformation
rule for each raw feature, which is time-efﬁcient and eliminates
feature space explosion problem. The discovered transforma-
tion rule can be very complex since new features can be
generated through a series of functional composition.
B. Neural Network Controller
NAS [14] proposes a general framework to search for the
architecture of a deep neural network, which employs a recur-
rent neural network controller to generate a string. Training
the network speciﬁed by the string will result in an accuracy
on a validation set. Using this accuracy as reward signal, Fig. 1. An Overview of Neural Feature Search
NAS utilizes policy gradient to update the controller. Apart
from searching for the neural network architectures, Searching Formally, the goal of automated feature engineering is
for Activation Functions [15] employs a similar controller to to search the optimal transformed feature set F∗ where
search for activation function named Swish. A Hierarchical VE(F∗,y) is maximized, F∗ = argmax VE(F,y).
Model for Device Placement [16] uses hierarchical controller L L
to search for optimal placement of computational graphs onto F
hardware devices. In this paper, we apply a neural controller B. An Overview of the Proposed Approach
to generate transformation rule (i.e. a series of transformation In this paper, we propose Neural Feature Search (NFS),
functions) for each raw feature. a neural architecture for automated feature engineering.
III. PROPOSED APPROACH We employ a Recurrent Neural Network (RNN) to gen-
erate a transformation rule A (i.e., a series of transfor-
In this section, we propose and introduce Neural Feature mation functions with the maximum order of T, such as
Search (NFS), a novel neural architecture for automated fea- reciprocal(square(height)) × weight) for each raw feature
ture engineering. within T time steps. That is, rather than performing transfor-
A. Problem Formulation mation on the whole dataset or on the sampled features, we
aim to ﬁnd the most promising transformation rule for every
Given a dataset D = hF,yi that consists of raw features feature. One big advantage of our approach is that rather than
F ={f ,f ,...,f } with a target vector y, we use V E(F,y) adopting a hierarchical structure which is likely to introduce
1 2 n L
to denote the performance of a machine learning algorithm L feature space explosion problem, our architecture performs
(e.g. Random Forest or Logistic Regression) constructed on the most promising transformation for each feature in parallel
F and measured by an evaluation metric E (e.g. F1-score or through an efﬁcient search algorithm.
mean squared error). Figure 1 gives an overview of the NFS architecture. For
ˆ datasets with n raw features, we build n RNNs as our
In addition, we transform a raw feature set F into F
through a set of transformation functions {A ,A ,...,A }. As controller to generate n transformation rules {A ,A ,...,A }
1 2 n 1 2 n
described in [4], [6], there are two categories of mathematical in parallel, where Ai is related to raw feature fi. Guided
transformation functions: unary functions that are performed by the transformation rules, we transform the raw feature
ˆ E ˆ
on a single feature (e.g. logarithm and square) and binary set F into F and compute its performance V (F,y) with
L
functions that are performed on a pair of features (e.g. plus respect to a learning algorithm L and an evaluation metric E.
and minus). Note that transformations can be performed on Next, we utilize policy gradient [17] to train our controller
E ˆ E
three or more features, since it is equivalent to applying binary via reward signal VL (F,y) − VL (F,y), which represents
transformations recursively. the performance improvement of those transformation rules.
Generally, feature transformation should be able to perform Figure 2 illustrates how each RNN predicts a transformation
the following two activities: (1) generating new features via rule. A transformation function is predicted at each time step
transforming the raw features. Note that transformation func- and they together form a transformation rule. Every prediction
tions can be composed to form a higher-order transformation is carried out by a softmax classiﬁer and then fed into the next
(e.g. BMI = weight , a third-order transformation, is formed time step as input.
2
height
by transforming the feature height through a square function, In the following subsection, we will ﬁrst show how our
then a reciprocal function, and subsequently a multiply transformation rules transform raw features. We then describe
function of feature weight). (2) deleting existing features our RNN-based controller, which generates transformation
through feature selection. rules. Finally, we show how the proposed NFS architecture
Algorithm 1 Dataset Transformation
Input: Dataset D = {f ,f ,...,f } with n raw features;
1 2 n
Transformation Rules A = {A ,A ,...,A }, where each A =
1 2 n i
[a ,a ,...,a ];
1 2 T
ˆ
Output: New Dataset D after the transformation;
1: for each f ∈ D do
i
ˆ
2: tailor A to be A = [a ,a ,..,a ], where a is the
i i 1 2 t t+1
ﬁrst terminate transformation in Ai;
3: if t == 0 then
4: continue;
5: else
ˆ
6: if ∃a ∈ A = delete then
i i
7: delete fi from D;
8: else
ˆ
9: f =a (a ... (a (a (f ))));
Note: Each RNN predicts transformation function a at time i ˆt t−1 2 1 i
t 10: add f to D;
step t. A series of transformations [a ,a ,..,a ] form a i
1 2 T 11: end if
transformation rule. Every prediction is carried out by a 12: end if
softmax classiﬁer and then fed into the next time step. 13: end for
Fig. 2. The Generation of Transformation Rule by a RNN
can be effectively trained with policy gradient to maximize dataset as shown in Algorithm 1. In order to reduce the feature
the expected performance of the learning algorithm. space, NFS ﬁrst conducts feature selection on the raw dataset
C. Feature Reconstruction through High-order Transforma- by selecting β features according to the feature importance via
tions random forest [18] if the number of features of the raw dataset
As mentioned, there are two types of transformations: is greater than β. Given a raw feature fi and its corresponding
ˆ
transformation rule A , NFS ﬁrst tailors A to be A by
unary and binary. Our approach focuses on ﬁnding the most i i i
promising transformation rule A for each feature. However, removing all actions after the ﬁrst terminate transformation
ˆ
binary transformations take two features as input and thus (Ai = Ai if terminate does not exist) (line 2). This enables
cannot be performed on a single feature. Therefore, we unify NFS to automatically ﬁnd the proper length of transformation
a binary transformation by converting it into a set of unary rule A, which is the order of transformation. If the length of
ˆ
transformations for each individual feature f . For example, Ai is 0, which means that the ﬁrst element of Ai is terminate,
i ˆ
for the binary transformation plus, we convert it into a set of then no feature is generated or eliminated (line 4). If Ai has
ˆ
unary transformation addition(f ). length T > 0 and there exists an delete element in Ai, we
i then delete the raw feature fi (line 7). Otherwise, we generate
We have two special unary transformations: delete and ˆ
a new feature f = a (a ... (a (a (f )))) (line 9), where t
terminate. delete removes the corresponding feature and termi- i t t−1 2 1 i
ˆ ˆ
nate indicates a stop command for the current transformation. is the length of Ai. Finally, we add fi to the dataset (line 10).
Such two transformations enable our architecture to eliminate
features and stop at the most suitable order of transformation NFS traverses and transforms every feature in parallel,
(i.e., the number of feature composition), respectively. which facilitates high-order transformations. We make each
For every raw feature, we utilize a series of transformation binary transformation to n unary transformations and reduce
functions A =[a ,a ,...,a ] to represent a transformation n T 2T T
1:T 1 2 T the searching space from ( 2 ) = O(n ) to n × n =
rule, where the length of the series T is the max order of O(nT+1), where n is the number of raw features and T is the
transformations it can take. Each element in the series is maxtransformation order. Within the reduced searching space,
a unary transformation or a uniﬁed binary transformation. our architecture explores the most promising transformation
For dataset with n raw features, we have n transformation for each raw feature. Therefore, NFS does not suffer from the
rules, and the transformation process for each raw feature is feature explosion problem and can capture high-order trans-
independent of each other since we have already captured formation meanwhile. Our architecture also supports feature
the correlation between features by unifying every binary selection because of the special action delete. Furthermore,
transformation. As a result, we can perform transformation the terminate function enables NFS to automatically stop at a
of each feature in parallel. certain time step, so transformation rules of variable lengths
Guided by all transformation rules, NFS transforms the raw can be generated.

The words contained in this file might help you see if this file matches what you are looking for:

...Neural feature search a architecture for automated engineering xiangning chen qingwei lin chuan luo xudong li hongyu zhang yong xu yingnong dang kaixin sui bo qiao weiyi wei wu murali chintalapati and dongmei tsinghua university china microsoft research of california los angeles united states the newcastle australia azure nanjing technology sydney cxn mails edu cn qlin yox yidang kasui boqiao v weiyzh muralic dongmeiz com lixudong ucla au zhangxu smail nju william third gmail abstract is crucial step developing which rst expands set then performs effective machine learning models traditionally engi selection in phase expansion all possible or ran neering performed manually requires much domain domly sampled transformation functions are applied to raw knowledge time consuming recent years many auto features reduction selected mated methods have been proposed these based on improvement model performance however improve accuracy by automatically transforming original into number transform...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area