152x Filetype PDF File size 0.64 MB Source: www.gbv.de
Data Wrangling with Python Jacqueline Kazil and Katharine Jarmul Beijing Boston Farnham Sebastopol Tokyo Table of Contents Preface xi 1. Introduction to Python 1 Why Python 4 Getting Started with Python 5 Which Python Version 6 Setting Up Python on Your Machine 7 Test Driving Python 11 Install pip 14 Install a Code Editor 15 Optional: Install IPython 16 Summary 16 2. Python Basics 17 Basic Data Types 18 Strings 18 Integers and Floats 19 Data Containers 23 Variables 23 Lists 25 Dictionaries 27 What Can the Various Data Types Do? 28 String Methods: Things Strings Can Do 30 Numerical Methods: Things Numbers Can Do 31 List Methods: Things Lists Can Do 32 Dictionary Methods: Things Dictionaries Can Do 33 Helpful Tools: type, dir, and help 34 type 34 v dir 35 help 37 Putting It All Together 38 What Does It All Mean? 38 Summary 40 3. Data Meant to Be Read by Machines 43 CSV Data 44 How to Import CSV Data 46 Saving the Code to a File; Running from Command Line 49 JSON Data 52 How to Import JSON Data 53 XML Data 55 How to Import XML Data 57 Summary 70 4. Working with Excel Files 73 Installing Python Packages 73 Parsing Excel Files 75 Getting Started with Parsing 75 Summary 89 5. PDFsand Problem Solving in Python 91 Avoid Using PDFs! 91 Programmatic Approaches to PDF Parsing 92 Opening and Reading Using slate 94 Converting PDF to Text 96 Parsing PDFs Using pdfminer 97 Learning How to Solve Problems 115 Exercise: Use Table Extraction, Try a Different Library 116 Exercise: Clean the Data Manually 121 Exercise: Try Another Tool 121 Uncommon File Types 124 Summary 124 6. Acquiring and Storing Data 127 Not All Data Is Created Equal 128 Fact Checking 128 Readability, Cleanliness, and Longevity 129 Where to Find Data 130 Using a Telephone 130 US Government Data 132 vi | Table of Contents Government and Civic Open Data Worldwide 133 Organization and Non-Government Organization (NGO) Data 135 Education and University Data 135 Medical and Scientific Data 136 Crowdsourced Data and APIs 136 Case Studies: Example Data Investigation 137 Ebola Crisis 138 Train Safety 138 Football Salaries 139 Child Labor 139 Storing Your Data: When, Why, and How? 140 Databases: A Brief Introduction 141 Relational Databases: MySQL and PostgreSQL 142 Non-Relational Databases: NoSQL 144 Setting Up Your Local Database with Python 145 When to Use a Simple File 147 Cloud-Storage and Python 147 Local Storage and Python 148 Alternative Data Storage 148 Summary 148 7. Data Cleanup: Investigation, Matching, and Formatting 151 Why Clean Data? 151 Data Cleanup Basics 152 Identifying Values for Data Cleanup 153 Formatting Data 164 Finding Outliers and Bad Data 169 Finding Duplicates 175 Fuzzy Matching 179 RegEx Matching 183 What to Do with Duplicate Records 188 Summary 189 8. Data Cleanup: Standardizing and Scripting 193 Normalizing and Standardizing Your Data 193 Saving Your Data 194 Determining What Data Cleanup Is Right for Your Project 197 Scripting Your Cleanup 198 Testing with New Data 214 Summary 216 Table of Contents | vii
no reviews yet
Please Login to review.