Data Preparation For Machine Learning Pdf 180628

Partial capture of text on file.

Data Preparation = Data Cleansing + Feature Engineering
Data Preparation is the heart of data science. It includes data cleansing and feature
engineering. Domain knowledge is also very important to achieve good results. Data
preparation cannot be fully automated; at least not in the beginning. Often this takes 60
to 80 percent of the whole analytical pipeline. However, it’s a mandatory task to get the
best accuracy from machine learning algorithms on your datasets.
Data Cleansing puts data into the right shape and quality for analysis. It includes many
functions, for example the following:
 Basics (select, filter, removal of duplicates, …)
 Sampling (balanced, stratified, ...)
 Data Partitioning (create training + validation + test data set, ...)
 Transformations (normalisation, standardisation, scaling, pivoting, ...)
 Binning (count-based, handling of missing values as its own group, …)
 Data Replacement (cutting, splitting, merging, ...)
 Weighting and Selection (attribute weighting, automatic optimization, ...)
 Attribute Generation (ID generation, ...)
 Imputation (replacement of missing observations by using statistical algorithms)
Feature Engineering selects the right attributes to analyze. You use domain knowledge
of the data to select or create attributes that make machine learning algorithms work.
Feature Engineering process includes:
 Brainstorming or testing of features
 Feature selection
 Validation of how the features work with your model
 Improvement of features if needed
 Return to brainstorming / creation of more features until the work is done
Note that feature engineering is already part of the modelling step to build an analytic
model, but it also leverages data preparation features (such as extracting parts of a
string).
Both data cleansing and feature engineering are part of data preparation and
fundamental to the application of machine learning and deep learning. Both are also
difficult and time-consuming.
Data preparation occurs in different phases of an analytics project:
 Data Preprocessing: Preparation of data directly after accessing it from a data source.
Typically realized by a developer or data scientist for initial transformations,
aggregations and data cleansing. This step is done before the interactive analysis of
data begins. It is executed once.
 Data Wrangling: Preparation of data during the interactive data analysis and model
building. Typically done by a data scientist or business analyst to change views on a
dataset and for features engineering. This step iteratively changes the shape of a
dataset until it works well for finding insights or building a good analytic model.
The Need for Data Preprocessing and Data Wrangling
Let’s take a look at the typical analytical pipeline when you build an analytic model:
1. Data Access
2. Data Preprocessing
3. Exploratory Data Analysis (EDA)
4. Model Building
5. Model Validation
6. Model Execution
7. Deployment
Step 2 focuses on data preprocessing before you build an analytic model, while data
wrangling is used in step 3 and 4 to adjust data sets interactively while analyzing data
and building a model. This is also called ‘data wrangling’. Note that these three steps
(2,3 and 4) can include both data cleansing and feature engineering.

The words contained in this file might help you see if this file matches what you are looking for:

...Data preparation cleansing feature engineering is the heart of science it includes and domain knowledge also very important to achieve good results cannot be fully automated at least not in beginning often this takes percent whole analytical pipeline however s a mandatory task get best accuracy from machine learning algorithms on your datasets puts into right shape quality for analysis many functions example following basics select filter removal duplicates sampling balanced stratified partitioning create training validation test set transformations normalisation standardisation scaling pivoting binning count based handling missing values as its own group replacement cutting splitting merging weighting selection attribute automatic optimization generation id imputation observations by using statistical selects attributes analyze you use or that make work process brainstorming testing features how with model improvement if needed return creation more until done note already part modelli...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area