Introduction 1 About This Book 1 Foolish Assumptions 3 Icons Used in This Book 4 Beyond the Book 4 Where to Go from Here 5 Part 1: Getting Started With Data Science and Python 7 Chapter 1: Discovering the Match between Data Science and Python 9 Defining the Sexiest Job of the 21st Century 11 Considering the emergence of data science 12 Outlining the core competencies of a data scientist 12 Linking data science, big data, and AI 13 Understanding the role of programming 14 Creating the Data Science Pipeline 14 Preparing the data 15 Performing exploratory data analysis 15 Learning from data 15 Visualizing 15 Obtaining insights and data products 16 Understanding Python's Role in Data Science 16 Considering the shifting profile of data scientists 16 Working with a multipurpose, simple, and efficient language 17 Learning to Use Python Fast 18 Loading data 19 Training a model 19 Viewing a result 19 Chapter 2: Introducing Python's Capabilities and Wonders 21 Why Python? 22 Grasping Python's Core Philosophy 23 Contributing to data science 23 Discovering present and future development goals 24 Working with Python 25 Getting a taste of the language 25 Understanding the need for indentation 26 Working at the command line or in the IDE 27 Performing Rapid Prototyping and Experimentation 31 Considering Speed of Execution 32 Visualizing Power 33 Using the Python Ecosystem for Data Science 35 Accessing scientific tools using SciPy 35 Performing fundamental scientific computing using NumPy 36 Performing data analysis using pandas 36 Implementing machine learning using Scikit-learn 36 Going for deep learning with Keras and TensorFlow 37 Plotting the data using matplotlib 38 Creating graphs with NetworkX 38 Parsing HTML documents using Beautiful Soup 38 Chapter 3: Setting Up Python for Data Science 39 Considering the Off-the-Shelf Cross-Platform Scientific Distributions 40 Getting Continuum Analytics Anaconda 40 Getting Enthought Canopy Express 41 Getting WinPython 42 Installing Anaconda on Windows 42 Installing Anaconda on Linux 46 Installing Anaconda on Mac OS X 47 Downloading the Datasets and Example Code 48 Using Jupyter Notebook 49 Defining the code repository 50 Understanding the datasets used in this book 57 Chapter 4: Working with Google Colab 59 Defining Google Colab 60 Understanding what Google Colab does 60 Considering the online coding difference 61 Using local runtime support 63 Getting a Google Account 63 Creating the account 64 Signing in 64 Working with Notebooks 65 Creating a new notebook 65 Opening existing notebooks 66 Saving notebooks 68 Downloading notebooks 71 Performing Common Tasks 71 Creating code cells 71 Creating text cells 72 Creating special cells 73 Editing cells 74 Moving cells 75 Using Hardware Acceleration 75 Executing the Code 76 Viewing Your Notebook 76 Displaying the table of contents 77 Getting notebook information 77 Checking code execution 78 Sharing Your Notebook 79 Getting Help 80 Part 2: Getting Your Hands Dirty With Data 81 Chapter 5: Understanding the Tools 83 Using the Jupyter Console 84 Interacting with screen text 84 Changing the window appearance 86 Getting Python help 87 Getting IPython help 89 Using magic functions 90 Discovering objects 91 Using Jupyter Notebook 93 Working with styles 93 Restarting the kernel 94 Restoring a checkpoint 95 Performing Multimedia and Graphic Integration 96 Embedding plots and other images 96 Loading examples from online sites 96 Obtaining online graphics and multimedia 96 Chapter 6: Working with Real Data 99 Uploading, Streaming, and Sampling Data 100 Uploading small amounts of data into memory 101 Streaming large amounts of data into memory 102 Generating variations on image data 103 Sampling data in different ways 104 Accessing Data in Structured Flat-File Form 105 Reading from a text file 106 Reading CSV delimited format 107 Reading Excel and other Microsoft Office files 109 Sending Data in Unstructured File Form 111 Managing Data from Relational Databases 113 Interacting with Data from NoSQL Databases 115 Accessing Data from the Web 116 Chapter 7: Conditioning Your Data 121 Juggling between NumPy and pandas 122 Knowing when to use NumPy 122 Knowing when to use pandas 122 Validating Your Data 124 Figuring out what's in your data 124 Removing duplicates 126 Creating a data map and data plan 126 Manipulating Categorical Variables 129 Creating categorical variables 130 Renaming levels 131 Combining levels 132 Dealing with Dates in Your Data 133 Formatting date and time values 134 Using the right time transformation 135 Dealing with Missing Data 136 Finding the missing data 136 Encoding missingness 137 Imputing missing data 138 Slicing and Dicing: Filtering and Selecting Data 139 Slicing rows 140 Slicing columns 140 Dicing 141 Concatenating and Transforming 142 Adding new cases and variables 142 Removing data 144 Sorting and shuffling 145 Aggregating Data at Any Level 146 Chapter 8: Shaping Data 149 Working with HTML Pages 150 Parsing XML and HTML 150 Using XPath for data extraction 151 Working with Raw Text 153 Dealing with Unicode 153 Stemming and removing stop words 153 Introducing regular expressions 155 Using the Bag of Words Model and Beyond 158 Understanding the bag of words model 159 Working with n-grams 161 Implementing TF-IDF transformations 162 Working with Graph Data 165 Understanding the adjacency matrix 165 Using NetworkX basics 166 Chapter 9: Putting What You Know in Action 169 Contextualizing Problems and Data 170 Evaluating a data science problem 171 Researching solutions 173 Formulating a hypothesis 174 Preparing your data 175 Considering the Art of Feature Creation 175 Defining feature creation 175 Combining variables 176 Understanding binning and discretization 177 Using indicator variables 177 Transforming distributions 178 Performing Operations on Arrays 178 Using vectorization 179 Performing simple arithmetic on vectors and matrices 179 Performing matrix vector multiplication 180 Performing matrix multiplication 181 Part 3: Visualizing Information 183 Chapter 10: Getting a Crash Course in MatPlotLib 185 Starting with a Graph 186 Defining the plot 186 Drawing multiple lines and plots 187 Saving your work to disk 188 Setting the Axis, Ticks, Grids 189 Getting the axes 189 Formatting the axes 190 Adding grids 191 Defining the Line Appearance 192 Working with line styles 193 Using colors 194 Adding markers 195 Using Labels, Annotations, and Legends 197 Adding labels 198 Annotating the chart 198 Creating a legend 199 Chapter 11: Visualizing the Data 201 Choosing the Right Graph 202 Showing parts of a whole with pie charts 202 Creating comparisons with bar charts 203 Showing distributions using histograms 205 Depicting groups using boxplots 206 Seeing data patterns using scatterplots 208 Creating Advanced Scatterplots 209 Depicting groups 209 Showing correlations 211 Plotting Time Series 212 Representing time on axes 212 Plotting trends over time 214 Plotting Geographical Data 216 Using an environment in Notebook 217 Getting the Basemap toolkit 218 Dealing with deprecated library issues 218 Using Basemap to plot geographic data 220 Visualizing Graphs 221 Developing undirected graphs 222 Developing directed graphs 224 Part 4: Wrangling Data 227 Chapter 12: Stretching Python's Capabilities 229 Playing with Scikit-learn 230 Understanding classes in Scikit-learn 230 Defining applications for data science 231 Performing the Hashing Trick 234 Using hash functions 235 Demonstrating the hashing trick 235 Working with deterministic selection 239 Considering Timing and Performance 240 Benchmarking with timeit 241 Working with the memory profiler 244 Running in Parallel on Multiple Cores 247 Performing multicore parallelism 248 Demonstrating multiprocessing 248 Chapter 13: Exploring Data Analysis 251 The EDA Approach 252 Defining Descriptive Statistics for Numeric Data 253 Measuring central tendency 254 Measuring variance and range 255 Working with percentiles 256 Defining measures of normality 257 Counting for Categorical Data 259 Understanding frequencies 259 Creating contingency tables 261 Creating Applied Visualization for EDA 261 Inspecting boxplots 262 Performing t-tests after boxplots 263 Observing parallel coordinates 264 Graphing distributions 265 Plotting scatterplots 266 Understanding Correlation 268 Using covariance and correlation 268 Using nonparametric correlation 270 Considering the chi-square test for tables 271 Modifying Data Distributions 272 Using different statistical distributions 272 Creating a Z-score standardization 273 Transforming other notable distributions 273 Chapter 14: Reducing Dimensionality 275 Understanding SVD 276 Looking for dimensionality reduction 277 Using SVD to measure the invisible 279 Performing Factor Analysis and PCA 280 Considering the psychometric model 280 Looking for hidden factors 281 Using components, not factors 282 Achieving dimensionality reduction 282 Squeezing information with t-SNE 283 Understanding Some Applications 285 Recognizing faces with PCA 285 Extracting topics with NMF 289 Recommending movies 291 Chapter 15: Clustering 295 Clustering with K-means 297 Understanding centroid-based algorithms 298 Creating an example with image data 299 Looking for optimal solutions 301 Clustering big data 304 Performing Hierarchical Clustering 305 Using a hierarchical cluster solution 307 Using a two-phase clustering solution 308 Discovering New Groups with DBScan 310 Chapter 16: Detecting Outliers in Data 313 Considering Outlier Detection 314 Finding more things that can go wrong 315 Understanding anomalies and novel data 316 Examining a Simple Univariate Method 317 Leveraging on the Gaussian distribution 319 Making assumptions and checking out 320 Developing a Multivariate Approach 322 Using principal component analysis 322 Using cluster analysis for spotting outliers 324 Automating detection with Isolation Forests 325 Part 5: Learning From Data 327 Chapter 17: Exploring Four Simple and Effective Algorithms 329 Guessing the Number: Linear Regression 329 Defining the family of linear models 330 Using more variables 331 Understanding limitations and problems 333 Moving to Logistic Regression 334 Applying logistic regression 335 Considering when classes are more 336 Making Things as Simple as Naive Bayes 337 Finding out that Naive Bayes isn't so naive 339 Predicting text classifications 340 Learning Lazily with Nearest Neighbors 342 Predicting after observing neighbors 343 Choosing your k parameter wisely 344 Chapter 18: Performing Cross-Validation, Selection, and Optimization 347 Pondering the Problem of Fitting a Model 348 Understanding bias and variance 349 Defining a strategy for picking models 350 Dividing between training and test sets 354 Cross-Validating 356 Using cross-validation on k folds 357 Sampling stratifications for complex data 358 Selecting Variables Like a Pro 360 Selecting by univariate measures 360 Using a greedy search 362 Pumping Up Your Hyperparameters 363 Implementing a grid search 364 Trying a randomized search 368 Chapter 19: Increasing Complexity with Linear and Nonlinear Tricks 371 Using Nonlinear Transformations 372 Doing variable transformations 372 Creating interactions between variables 375 Regularizing Linear Models 379 Relying on Ridge regression (L2) 380 Using the Lasso (L1) 381 Leveraging regularization 382 Combining L1 & L2: Elasticnet 382 Fighting with Big Data Chunk by Chunk 383 Determining when there is too much data 383 Implementing Stochastic Gradient Descent 383 Understanding Support Vector Machines 387 Relying on a computational method 387 Fixing many new parameters 390 Classifying with SVC 392 Going nonlinear is easy 398 Performing regression with SVR 399 Creating a stochastic solution with SVM 401 Playing with Neural Networks 406 Understanding neural networks 407 Classifying and regressing with neurons 408 Chapter 20: Understanding the Power of the Many 411 Starting with a Plain Decision Tree 412 Understanding a decision tree 412 Creating trees for different purposes 415 Making Machine Learning Accessible 418 Working with a Random Forest classifier 420 Working with a Random Forest regressor 421 Optimizing a Random Forest 422 Boosting Predictions 424 Knowing that many weak predictors win 424 Setting a gradient boosting classifier 425 Running a gradient boosting regressor 426 Using GBM hyperparameters 427 Part 6: The Part of Tens 429 Chapter 21: Ten Essential Data Resources 431 Discovering the News with Subreddit 432 Getting a Good Start with KDnuggets 432 Locating Free Learning Resources with Quora 432 Gaining Insights with Oracle's Data Science Blog 433 Accessing the Huge List of Resources on Data Science Central 433 Learning New Tricks from the Aspirational Data Scientist 434 Obtaining the Most Authoritative Sources at Udacity 435 Receiving Help with Advanced Topics at Conductrics 435 Obtaining the Facts of Open Source Data Science from Masters 436 Zeroing In on Developer Resources with Jonathan Bower 436 Chapter 22: Ten Data Challenges You Should Take 437 Meeting the Data Science London + Scikit-learn Challenge 438 Predicting Survival on the Titanic 438 Finding a Kaggle Competition that Suits Your Needs 439 Honing Your Overfit Strategies 440 Trudging Through the MovieLens Dataset 440 Getting Rid of Spam E-mails 441 Working with Handwritten Information 442 Working with Pictures 443 Analyzing Amazon.com Reviews 444 Interacting with a Huge Graph 444 Index 447
John Paul Mueller is a tech editor and the author of over 100 books on topics from networking and home security to database management and heads-down programming. Follow John's blog at http://blog.johnmuellerbooks.com/. Luca Massaron is a data scientist who specializes in organizing and interpreting big data and transforming it into smart data. He is a Google Developer Expert (GDE) in machine learning.