130x Filetype PDF File size 0.21 MB Source: www.cvs.edu.in
Data Preparation = Data Cleansing + Feature Engineering Data Preparation is the heart of data science. It includes data cleansing and feature engineering. Domain knowledge is also very important to achieve good results. Data preparation cannot be fully automated; at least not in the beginning. Often this takes 60 to 80 percent of the whole analytical pipeline. However, it’s a mandatory task to get the best accuracy from machine learning algorithms on your datasets. Data Cleansing puts data into the right shape and quality for analysis. It includes many functions, for example the following: Basics (select, filter, removal of duplicates, …) Sampling (balanced, stratified, ...) Data Partitioning (create training + validation + test data set, ...) Transformations (normalisation, standardisation, scaling, pivoting, ...) Binning (count-based, handling of missing values as its own group, …) Data Replacement (cutting, splitting, merging, ...) Weighting and Selection (attribute weighting, automatic optimization, ...) Attribute Generation (ID generation, ...) Imputation (replacement of missing observations by using statistical algorithms) Feature Engineering selects the right attributes to analyze. You use domain knowledge of the data to select or create attributes that make machine learning algorithms work. Feature Engineering process includes: Brainstorming or testing of features Feature selection Validation of how the features work with your model Improvement of features if needed Return to brainstorming / creation of more features until the work is done Note that feature engineering is already part of the modelling step to build an analytic model, but it also leverages data preparation features (such as extracting parts of a string). Both data cleansing and feature engineering are part of data preparation and fundamental to the application of machine learning and deep learning. Both are also difficult and time-consuming. Data preparation occurs in different phases of an analytics project: Data Preprocessing: Preparation of data directly after accessing it from a data source. Typically realized by a developer or data scientist for initial transformations, aggregations and data cleansing. This step is done before the interactive analysis of data begins. It is executed once. Data Wrangling: Preparation of data during the interactive data analysis and model building. Typically done by a data scientist or business analyst to change views on a dataset and for features engineering. This step iteratively changes the shape of a dataset until it works well for finding insights or building a good analytic model. The Need for Data Preprocessing and Data Wrangling Let’s take a look at the typical analytical pipeline when you build an analytic model: 1. Data Access 2. Data Preprocessing 3. Exploratory Data Analysis (EDA) 4. Model Building 5. Model Validation 6. Model Execution 7. Deployment Step 2 focuses on data preprocessing before you build an analytic model, while data wrangling is used in step 3 and 4 to adjust data sets interactively while analyzing data and building a model. This is also called ‘data wrangling’. Note that these three steps (2,3 and 4) can include both data cleansing and feature engineering.
no reviews yet
Please Login to review.