Introduction to Machine Learning with Python: A Guide for Data Scientists

Introduction to Machine Learning with Python: A Guide for Data Scientists

by Andreas M ller, Sarah Guido
Introduction to Machine Learning with Python: A Guide for Data Scientists

Introduction to Machine Learning with Python: A Guide for Data Scientists

by Andreas M ller, Sarah Guido

Paperback

$59.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You'll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you'll learn:

  • Fundamental concepts and applications of machine learning
  • Advantages and shortcomings of widely used machine learning algorithms
  • How to represent data processed by machine learning, including which data aspects to focus on
  • Advanced methods for model evaluation and parameter tuning
  • The concept of pipelines for chaining models and encapsulating your workflow
  • Methods for working with text data, including text-specific processing techniques
  • Suggestions for improving your machine learning and data science skills

Product Details

ISBN-13: 9781449369415
Publisher: O'Reilly Media, Incorporated
Publication date: 10/20/2016
Pages: 398
Sales rank: 447,006
Product dimensions: 6.90(w) x 9.10(h) x 0.60(d)

About the Author

Andreas Müller received his PhD in machine learning from the University of Bonn. After working as a machine learning researcher on computer vision applications at Amazon for a year, he recently joined the Center for Data Science at the New York University. In the last four years, he has been maintainer and one of the core contributor of scikit-learn, a machine learning toolkit widely used in industry and academia, and author and contributor to several other widely used machine learning packages. His mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Sarah is a data scientist who has spent a lot of time working in start-ups. She loves Python, machine learning, large quantities of data, and the tech world. She is an accomplished conference speaker, currently resides in New York City, and attended the University of Michigan for grad school.

Table of Contents

Preface vii

1 Introduction 1

Why Machine Learning? 1

Problems Machine Learning Can Solve 2

Knowing Your Task and Knowing Your Data 4

Why Python? 5

Scikit-learn 5

Installing scikit-learn 6

Essential Libraries and Tools 7

Jupyter Notebook 7

NumPy 7

SciPy 8

Matplotlib 9

Pandas 10

Mglearn 11

Python 2 Versus Python 3 12

Versions Used in this Book 12

A First Application: Classifying Iris Species 13

Meet the Data 14

Measuring Success: Training and Testing Data 17

First Things First: Look at Your Data 19

Building Your First Model: k-Nearest Neighbors 20

Making Predictions 22

Evaluating the Model 22

Summary and Outlook 23

2 Supervised Learning 25

Classification and Regression 25

Generalization, Overfitting, and Underfitting 26

Relation of Model Complexity to Dataset Size 29

Supervised Machine Learning Algorithms 29

Some Sample Datasets 30

K-Nearest Neighbors 35

Linear Models 45

Naive Bayes Classifiers 68

Decision Trees 70

Ensembles of Decision Trees 83

Kernelized Support Vector Machines 92

Neural Networks (Deep Learning) 104

Uncertainty Estimates from Classifiers 119

The Decision Function 120

Predicting Probabilities 122

Uncertainty in Multiclass Classification 124

Summary and Outlook 127

3 Unsupervised Learning and Preprocessing 131

Types of Unsupervised Learning 131

Challenges in Unsupervised Learning 132

Preprocessing and Scaling 132

Different Kinds of Preprocessing 133

Applying Data Transformations 134

Scaling Training and Test Data the Same Way 136

The Effect of Preprocessing on Supervised Learning 138

Dimensionality Reduction, Feature Extraction, and Manifold Learning 140

Principal Component Analysis (PCA) 140

Non-Negative Matrix Factorization (NMF) 156

Manifold Learning with t-SNE 163

Clustering 168

K-Means Clustering 168

Agglomerative Clustering 182

DBSCAN 187

Comparing and Evaluating Clustering Algorithms 191

Summary of Clustering Methods 207

Summary and Outlook 208

4 Representing Data and Engineering Features 211

Categorical Variables 212

One-Hot-Encoding (Dummy Variables) 213

Numbers Can Encode Categoricals 218

Binning, Discretization, Linear Models, and Trees 220

Interactions and Polynomials 224

Univariate Nonlinear Transformations 232

Automatic Feature Selection 236

Univariate Statistics 236

Model-Based Feature Selection 238

Iterative Feature Selection 240

Utilizing Expert Knowledge 242

Summary and Outlook 250

5 Model Evaluation and Improvement 251

Cross-Validation 252

Cross-Validation in scikit-learn 253

Benefits of Cross-Validation 254

Stratified k-Fold Cross-Validation and Other Strategies 254

Grid Search 260

Simple Grid Search 261

The Danger of Overfitting the Parameters and the Validation Set 261

Grid Search with Cross-Validation 263

Evaluation Metrics and Scoring 275

Keep the End Goal in Mind 275

Metrics for Binary Classification 276

Metrics for Multiclass Classification 296

Regression Metrics 299

Using Evaluation Metrics in Model Selection 300

Summary and Outlook 302

6 Algorithm Chains and Pipelines 305

Parameter Selection with Preprocessing 306

Building Pipelines 308

Using Pipelines in Grid Searches 309

The General Pipeline Interface 312

Convenient Pipeline Creation with make_pipeline 313

Accessing Step Attributes 314

Accessing Attributes in a Grid-Searched Pipeline 315

Grid-Searching Preprocessing Steps and Model Parameters 317

Grid-Searching Which Model To Use 319

Summary and Outlook 320

7 Working with Text Data 323

Types of Data Represented as Strings 323

Example Application: Sentiment Analysis of Movie Reviews 325

Representing Text Data as a Bag of Words 327

Applying Bag-of-Words to a Toy Dataset 329

Bag-of-Words for Movie Reviews 330

Stopwords 334

Rescaling the Data with tf-idf 336

Investigating Model Coefficients 338

Bag-of-Words with More Than One Word (n-Grams) 339

Advanced Tokenization, Stemming, and Lemmatization 344

Topic Modeling and Document Clustering 347

Latent Dirichlet Allocation 348

Summary and Outlook 355

8 Wrapping Up 357

Approaching a Machine Learning Problem 357

Humans in the Loop 358

From Prototype to Production 359

Testing Production Systems 359

Building Your Own Estimator 360

Where to Go from Here 361

Theory 361

Other Machine Learning Frameworks and Packages 362

Ranking, Recommender Systems, and Other Kinds of Learning 363

Probabilistic Modeling, Inference, and Probabilistic Programming 363

Neural Networks 364

Scaling to Larger Datasets 364

Honing Your Skills 365

Conclusion 366

Index 367

From the B&N Reads Blog

Customer Reviews