Showing posts with label Machine. Show all posts

A Problem-Solver’s Guide to Building Real-World Intelligent Systems

Dipanjan Sarkar . Raghav Bali . Tushar Sharma


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 3.00
 Pages
 545 p
 File Size 
 19,858 KB
 File Type
 PDF format
 ISBN-13
 978-1-4842-3206-4 (pbk)
 978-1-4842-3207-1 (electronic) 
 Copyright©   
 2018 by Dipanjan Sarkar,
 Raghav Bali and Tushar Sharma 

About the Authors
Dipanjan Sarkar is a data scientist at Intel, on a mission to make the
world more connected and productive. He primarily works on Data
Science, analytics, business intelligence, application development, and
building large-scale intelligent systems. He holds a master of technology
degree in Information Technology with specializations in Data Science
and Software Engineering from the International Institute of Information
Technology, Bangalore. He is also an avid supporter of self-learning,
especially Massive Open Online Courses and also holds a Data Science
Specialization from Johns Hopkins University on Coursera.
Dipanjan has been an analytics practitioner for several years,
specializing in statistical, predictive, and text analytics. Having a
passion for Data Science and education, he is a Data Science Mentor
at Springboard, helping people up-skill on areas like Data Science and
Machine Learning. Dipanjan has also authored several books on R,
Python, Machine Learning, and analytics, including Text Analytics with
Python, Apress 2016. Besides this, he occasionally reviews technical books
and acts as a course beta tester for Coursera. Dipanjan’s interests include learning about new technology, financial markets, disruptive start-ups, Data Science, and more recently, artificial intelligence and Deep Learning.
Raghav Bali is a data scientist at Intel, enabling proactive and data-driven
IT initiatives. He primarily works on Data Science, analytics, business
intelligence, and development of scalable Machine Learning-based
solutions. He has also worked in domains such as ERP and finance with
some of the leading organizations in the world. Raghav has a master’s
degree (gold medalist) in Information Technology from International
Institute of Information Technology, Bangalore.
Raghav is a technology enthusiast who loves reading and playing
around with new gadgets and technologies. He has also authored
several books on R, Machine Learning, and Analytics. He is a shutterbug,
capturing moments when he isn’t busy solving problems.
Tushar Sharma has a master’s degree from International Institute of
Information Technology, Bangalore. He works as a Data Scientist with
Intel. His work involves developing analytical solutions at scale using
enormous volumes of infrastructure data. In his previous role, he worked
in the financial domain developing scalable Machine Learning solutions
for major financial organizations. He is proficient in Python, R, and Big
Data frameworks like Spark and Hadoop.
Apart from work, Tushar enjoys watching movies, playing badminton,
and is an avid reader. He has also authored a book on R and social media analytics.

About the Technical Reviewer
Jojo Moolayil is an Artificial Intelligence professional and published
author of the book: Smarter Decisions – The Intersection of IoT and
Decision Science. With over five years of industrial experience in A.I.,
Machine Learning, Decision Science, and IoT, he has worked with
industry leaders on high impact and critical projects across multiple
verticals. He is currently working with General Electric, the pioneer and
leader in Data Science for Industrial IoT, and lives in Bengaluru—the
Silicon Valley of India.
He was born and raised in Pune, India and graduated from University
of Pune with a major in Information Technology Engineering. He started
his career with Mu Sigma Inc., the world’s largest pure play analytics
provider and then Flutura, an IoT Analytics startup. He has also worked
with the leaders of many Fortune 50 clients.
In his present role with General Electric, he focuses on solving A.I.
and decision science problems for Industrial IoT use cases and developing
Data Science products and platforms for Industrial IoT.
Apart from authoring books on decision science and IoT, Jojo has also been technical reviewer for
various books on Machine Learning and Business Analytics with Apress. He is an active Data Science tutor and maintains a blog at http://www.jojomoolayil.com/web/blog/.
You can reach out to Jojo at:
I would like to thank my family, 
friends, and mentors for their kind support and constant motivation throughout my life.

Foreword
The availability of affordable compute power enabled by Moore’s law has been enabling rapid advances in Machine Learning solutions and driving adoption across diverse segments of the industry. The ability to learn complex models underlying the real-world processes from observed (training) data through systemic, easy-to-apply Machine Learning solution stacks has been of tremendous attraction to businesses to harness meaningful business value. The appeal and opportunities of Machine Learning have resulted in the availability of many resources—books, tutorials, online training, and courses for solution developers, analysts, engineers, and scientists to learn the algorithms and implement platforms and methodologies. It is not uncommon for someone just starting out to get overwhelmed by the abundance of the material. In addition, not following a structured workflow might not yield consistent and relevant results with Machine Learning solutions.
Key requirements for building robust Machine Learning applications and getting consistent, actionable
results involve investing significant time and effort in understanding the objectives and key value of
the project, establishing robust data pipelines, analyzing and visualizing data, and feature engineering,
selection, and modeling. 
The iterative nature of these projects involves several Select → Apply → Validate → Tune cycles before coming up with a suitable Machine Learning-based model. A final and important
step is to integrate the solution (Machine Learning model) into existing (or new) organization systems
or business processes to sustain actionable and relevant results. Hence, the broad requirements of the
ingredients for a robust Machine Learning solution require a development platform that is suited not just for interactive modeling of Machine Learning, but also excels in data ingestion, processing, visualization, systems integration, and strong ecosystem support for runtime deployment and maintenance. Python is an excellent choice of language because it fits the need of the hour with its multi-purpose capabilities, ease of implementation and integration, active developer community, and ever-growing Machine Learning ecosystem, leading to its adoption for Machine Learning growing rapidly. The authors of this book have leveraged their hands-on experience with solving real-world problems using Python and its Machine Learning ecosystem to help the readers gain the solid knowledge needed to apply essential concepts, methodologies, tools, and techniques for solving their own real-world problems and use-cases. Practical Machine Learning with Python aims to cater to readers with varying skill levels ranging from beginners to experts and enable them in structuring and building practical Machine Learning solutions.
—Ram R. Varra, Senior Principal Engineer, Intel


Table of Contents
About the Authors ..................................................................................................xvii
About the Technical Reviewer ................................................................................xix
Acknowledgments ..................................................................................................xxi
Foreword ..............................................................................................................xxiii
Introduction ...........................................................................................................xxv
■■Part I: Understanding Machine Learning ............................................ 1
■■Chapter 1: Machine Learning Basics ..................................................................... 3
The Need for Machine Learning ....................................................................................... 4
Making Data-Driven Decisions ...............................................................................................................4
Efficiency and Scale ...............................................................................................................................5
Traditional Programming Paradigm ........................................................................................................5
Why Machine Learning? .........................................................................................................................6
Understanding Machine Learning .................................................................................... 8
Why Make Machines Learn? .................................................................................................................. 8
Formal Definition ....................................................................................................................................9
A Multi-Disciplinary Field .....................................................................................................................13
Computer Science .......................................................................................................... 14
Theoretical Computer Science ............................................................................................................. 15
Practical Computer Science .................................................................................................................15
Important Concepts ..............................................................................................................................15
Data Science .................................................................................................................. 16
Mathematics .................................................................................................................. 18
Important Concepts ..............................................................................................................................19
Statistics ........................................................................................................................ 24
Data Mining .................................................................................................................... 25
Artificial Intelligence ...................................................................................................... 25
Natural Language Processing ........................................................................................ 26
Deep Learning ................................................................................................................ 28
Important Concepts ..............................................................................................................................31
Machine Learning Methods ............................................................................................ 34
Supervised Learning ...................................................................................................... 35
Classification ........................................................................................................................................36
Regression ............................................................................................................................................37
Unsupervised Learning .................................................................................................. 38
Clustering .............................................................................................................................................39
Dimensionality Reduction .....................................................................................................................40
Anomaly Detection ............................................................................................................................... 41
Association Rule-Mining .......................................................................................................................41
Semi-Supervised Learning ............................................................................................. 42
Reinforcement Learning ................................................................................................. 42
Batch Learning ............................................................................................................... 43
Online Learning .............................................................................................................. 44
Instance Based Learning ................................................................................................ 44
Model Based Learning .................................................................................................... 45
The CRISP-DM Process Model ........................................................................................ 45
Business Understanding .......................................................................................................................46
Data Understanding ..............................................................................................................................48
Data Preparation ...................................................................................................................................50
Modeling ...............................................................................................................................................51
Evaluation .............................................................................................................................................52
Deployment ..........................................................................................................................................52
Building Machine Intelligence ........................................................................................ 52
Machine Learning Pipelines .................................................................................................................52
Supervised Machine Learning Pipeline ................................................................................................54
Unsupervised Machine Learning Pipeline ............................................................................................55
Real-World Case Study: Predicting Student Grant Recommendations ........................... 55
Objective ...............................................................................................................................................56
Data Retrieval .......................................................................................................................................56
Data Preparation ...................................................................................................................................57
Modeling ...............................................................................................................................................60
Model Evaluation ..................................................................................................................................61
Model Deployment ................................................................................................................................61
Prediction in Action ...............................................................................................................................62
Challenges in Machine Learning .................................................................................... 64
Real-World Applications of Machine Learning ............................................................... 64
Summary ........................................................................................................................ 65
■■Chapter 2: The Python Machine Learning Ecosystem ......................................... 67
Python: An Introduction .................................................................................................. 67
Strengths ..............................................................................................................................................68
Pitfalls ...................................................................................................................................................68
Setting Up a Python Environment .........................................................................................................69
Why Python for Data Science? .............................................................................................................71
Introducing the Python Machine Learning Ecosystem ................................................... 72
Jupyter Notebooks ................................................................................................................................72
NumPy ..................................................................................................................................................75
Pandas ..................................................................................................................................................84
Scikit-learn ...........................................................................................................................................96
Neural Networks and Deep Learning ..................................................................................................102
Text Analytics and Natural Language Processing ............................................................................... 112
Statsmodels ........................................................................................................................................116
Summary ...................................................................................................................... 118
■■Part II: The Machine Learning Pipeline ........................................... 119
■■Chapter 3: Processing, Wrangling, and Visualizing Data ................................... 121
Data Collection ............................................................................................................. 122
CSV .....................................................................................................................................................122
JSON ...................................................................................................................................................124
XML .....................................................................................................................................................128
HTML and Scraping ............................................................................................................................131
SQL .....................................................................................................................................................136
Data Description ........................................................................................................... 137
Numeric ..............................................................................................................................................137
Text .....................................................................................................................................................137
Categorical .........................................................................................................................................137
Data Wrangling ............................................................................................................. 138
Understanding Data ............................................................................................................................138
Filtering Data ......................................................................................................................................141
Typecasting .........................................................................................................................................144
Transformations ..................................................................................................................................144
Imputing Missing Values .....................................................................................................................145
Handling Duplicates ............................................................................................................................147
Handling Categorical Data ..................................................................................................................147
Normalizing Values .............................................................................................................................148
String Manipulations ..........................................................................................................................149
Data Summarization ..................................................................................................... 149
Data Visualization ......................................................................................................... 151
Visualizing with Pandas ......................................................................................................................152
Visualizing with Matplotlib ................................................................................................................. 161
Python Visualization Ecosystem .........................................................................................................176
Summary ...................................................................................................................... 176
■■Chapter 4: Feature Engineering and Selection .................................................. 177
Features: Understand Your Data Better ........................................................................ 178
Data and Datasets ..............................................................................................................................178
Features ..............................................................................................................................................179
Models ................................................................................................................................................179
Revisiting the Machine Learning Pipeline .................................................................... 179
Feature Extraction and Engineering ............................................................................. 181
What Is Feature Engineering? ............................................................................................................ 181
Why Feature Engineering? ..................................................................................................................183
How Do You Engineer Features? .........................................................................................................184
Feature Engineering on Numeric Data ......................................................................... 185
Raw Measures ....................................................................................................................................185
Binarization .........................................................................................................................................187
Rounding ............................................................................................................................................188
Interactions .........................................................................................................................................189
Binning ...............................................................................................................................................191
Statistical Transformations .................................................................................................................197
Feature Engineering on Categorical Data ..................................................................... 200
Transforming Nominal Features .........................................................................................................201
Transforming Ordinal Features ...........................................................................................................202
Encoding Categorical Features ...........................................................................................................203
Feature Engineering on Text Data ................................................................................ 209
Text Pre-Processing ............................................................................................................................210
Bag of Words Model ............................................................................................................................211
Bag of N-Grams Model .......................................................................................................................212
TF-IDF Model ......................................................................................................................................213
Document Similarity ...........................................................................................................................214
Topic Models .......................................................................................................................................216
Word Embeddings ...............................................................................................................................217
Feature Engineering on Temporal Data ........................................................................ 220
Date-Based Features ..........................................................................................................................221
Time-Based Features .........................................................................................................................222
Feature Engineering on Image Data ............................................................................. 224
Image Metadata Features ...................................................................................................................225
Raw Image and Channel Pixels ..........................................................................................................225
Grayscale Image Pixels .......................................................................................................................227
Binning Image Intensity Distribution ..................................................................................................227
Image Aggregation Statistics ..............................................................................................................228
Edge Detection ...................................................................................................................................229
Object Detection .................................................................................................................................230
Localized Feature Extraction ..............................................................................................................231
Visual Bag of Words Model .................................................................................................................233
Automated Feature Engineering with Deep Learning ......................................................................... 236
Feature Scaling ............................................................................................................ 239
Standardized Scaling ..........................................................................................................................240
Min-Max Scaling .................................................................................................................................240
Robust Scaling ....................................................................................................................................241
Feature Selection ......................................................................................................... 242
Threshold-Based Methods ..................................................................................................................243
Statistical Methods .............................................................................................................................244
Recursive Feature Elimination ............................................................................................................247
Model-Based Selection .......................................................................................................................248
Dimensionality Reduction ............................................................................................. 249
Feature Extraction with Principal Component Analysis ...................................................................... 250
Summary ...................................................................................................................... 252
■■Chapter 5: Building, Tuning, and Deploying Models .......................................... 255
Building Models ............................................................................................................ 256
Model Types ........................................................................................................................................257
Learning a Model ................................................................................................................................260
Model Building Examples ...................................................................................................................263
Model Evaluation .......................................................................................................... 271
Evaluating Classification Models ........................................................................................................271
Evaluating Clustering Models .............................................................................................................278
Evaluating Regression Models ........................................................................................................... 281
Model Tuning ................................................................................................................ 282
Introduction to Hyperparameters ........................................................................................................283
The Bias-Variance Tradeoff .................................................................................................................284
Cross Validation ..................................................................................................................................288
Hyperparameter Tuning Strategies .....................................................................................................291
Model Interpretation ..................................................................................................... 295
Understanding Skater .........................................................................................................................297
Model Interpretation in Action ............................................................................................................298
Model Deployment ....................................................................................................... 302
Model Persistence ..............................................................................................................................302
Custom Development .........................................................................................................................303
In-House Model Deployment ..............................................................................................................303
Model Deployment as a Service .........................................................................................................304
Summary ...................................................................................................................... 304
■■Part III: Real-World Case Studies ................................................... 305
■■Chapter 6: Analyzing Bike Sharing Trends ........................................................ 307
The Bike Sharing Dataset ............................................................................................. 307
Problem Statement ...................................................................................................... 308
Exploratory Data Analysis ............................................................................................. 308
Preprocessing .....................................................................................................................................308
Distribution and Trends .......................................................................................................................310
Outliers ...............................................................................................................................................312
Correlations ........................................................................................................................................314
Regression Analysis ..................................................................................................... 315
Types of Regression ...........................................................................................................................315
Assumptions .......................................................................................................................................316
Evaluation Criteria ..............................................................................................................................316
Modeling ...................................................................................................................... 317
Linear Regression ...............................................................................................................................319
Decision Tree Based Regression .........................................................................................................323
Next Steps .................................................................................................................... 330
Summary ...................................................................................................................... 330
■■Chapter 7: Analyzing Movie Reviews Sentiment ............................................... 331
Problem Statement ...................................................................................................... 332
Setting Up Dependencies ............................................................................................. 332
Getting the Data ........................................................................................................... 333
Text Pre-Processing and Normalization ....................................................................... 333
Unsupervised Lexicon-Based Models .......................................................................... 336
Bing Liu’s Lexicon ...............................................................................................................................337
MPQA Subjectivity Lexicon .................................................................................................................337
Pattern Lexicon ...................................................................................................................................338
AFINN Lexicon ....................................................................................................................................338
SentiWordNet Lexicon ........................................................................................................................340
VADER Lexicon ....................................................................................................................................342
Classifying Sentiment with Supervised Learning ......................................................... 345
Traditional Supervised Machine Learning Models ....................................................... 346
Newer Supervised Deep Learning Models ................................................................... 349
Advanced Supervised Deep Learning Models .............................................................. 355
Analyzing Sentiment Causation .................................................................................... 363
Interpreting Predictive Models ...........................................................................................................363
Analyzing Topic Models ......................................................................................................................368
Summary ...................................................................................................................... 372
■■Chapter 8: Customer Segmentation and Effective Cross Selling ....................... 373
Online Retail Transactions Dataset ............................................................................... 374
Exploratory Data Analysis ............................................................................................. 374
Customer Segmentation ............................................................................................... 378
Objectives ...........................................................................................................................................378
Strategies ...........................................................................................................................................379
Clustering Strategy .............................................................................................................................380
Cross Selling ................................................................................................................ 392
Market Basket Analysis with Association Rule-Mining ....................................................................... 393
Association Rule-Mining Basics .........................................................................................................394
Association Rule-Mining in Action ......................................................................................................396
Summary ...................................................................................................................... 405
■■Chapter 9: Analyzing Wine Types and Quality ................................................... 407
Problem Statement ...................................................................................................... 407
Setting Up Dependencies ............................................................................................. 408
Getting the Data ........................................................................................................... 408
Exploratory Data Analysis ............................................................................................. 409
Process and Merge Datasets ..............................................................................................................409
Understanding Dataset Features ........................................................................................................410
Descriptive Statistics ..........................................................................................................................413
Inferential Statistics ............................................................................................................................414
Univariate Analysis .............................................................................................................................416
Multivariate Analysis ..........................................................................................................................419
Predictive Modeling ...................................................................................................... 426
Predicting Wine Types .................................................................................................. 427
Predicting Wine Quality ................................................................................................ 433
Summary ...................................................................................................................... 446
■■Chapter 10: Analyzing Music Trends and Recommendations ........................... 447
The Million Song Dataset Taste Profile ......................................................................... 448
Exploratory Data Analysis ............................................................................................. 448
Loading and Trimming Data ................................................................................................................448
Enhancing the Data ............................................................................................................................451
Visual Analysis ....................................................................................................................................452
Recommendation Engines ............................................................................................ 456
Types of Recommendation Engines ....................................................................................................457
Utility of Recommendation Engines ....................................................................................................457
Popularity-Based Recommendation Engine ....................................................................................... 458
Item Similarity Based Recommendation Engine .................................................................................459
Matrix Factorization Based Recommendation Engine ........................................................................ 461
A Note on Recommendation Engine Libraries .............................................................. 466
Summary ...................................................................................................................... 466
■■Chapter 11: Forecasting Stock and Commodity Prices ..................................... 467
Time Series Data and Analysis ..................................................................................... 467
Time Series Components ....................................................................................................................469
Smoothing Techniques .......................................................................................................................471
Forecasting Gold Price ................................................................................................. 474
Problem Statement .............................................................................................................................474
Dataset ...............................................................................................................................................474
Traditional Approaches .......................................................................................................................474
Modeling .............................................................................................................................................476
Stock Price Prediction .................................................................................................. 483
Problem Statement .............................................................................................................................484
Dataset ...............................................................................................................................................484
Recurrent Neural Networks: LSTM .....................................................................................................485
Upcoming Techniques: Prophet ..........................................................................................................495
Summary ...................................................................................................................... 497
■■Chapter 12: Deep Learning for Computer Vision ............................................... 499
Convolutional Neural Networks .................................................................................... 499
Image Classification with CNNs ................................................................................... 501
Problem Statement .............................................................................................................................501
Dataset ...............................................................................................................................................501
CNN Based Deep Learning Classifier from Scratch ............................................................................ 502
CNN Based Deep Learning Classifier with Pretrained Models ............................................................ 505
Artistic Style Transfer with CNNs ................................................................................. 509
Background ........................................................................................................................................510
Preprocessing .....................................................................................................................................511
Loss Functions ....................................................................................................................................513
Custom Optimizer ...............................................................................................................................515
Style Transfer in Action .......................................................................................................................516
Summary ...................................................................................................................... 520
Index ..................................................................................................................... 521


Bookscreen
e-books shop

Introduction
Data is the new oil and Machine Learning is a powerful concept and framework for making the best out of it. In this age of automation and intelligent systems, it is hardly a surprise that Machine Learning and Data Science are some of the top buzz words. The tremendous interest and renewed investments in the field of Data Science across industries, enterprises, and domains are clear indicators of its enormous potential. Intelligent systems and data-driven organizations are becoming a reality and the advancements in tools and techniques is only helping it expand further. With data being of paramount importance, there has never been a higher demand for Machine Learning and Data Science practitioners than there is now. Indeed, the world is facing a shortage of data scientists. It’s been coined “The sexiest job in the 21st Century” which makes it all the more worthwhile to try to build some valuable expertise in this domain.  Practical Machine Learning with Python is a problem solver’s guide to building real-world intelligent systems. It follows a comprehensive three-tiered approach packed with concepts, methodologies, hands-on examples, and code. This book helps its readers master the essential skills needed to recognize and solve complex problems with Machine Learning and Deep Learning by following a data-driven mindset. Using real-world case studies that leverage the popular Python Machine Learning ecosystem, this book is your perfect companion for learning the art and science of Machine Learning to become a successful practitioner.
The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute Machine Learning systems and projects successfully.
This book will get you started on the ways to leverage the Python Machine Learning ecosystem with its
diverse set of frameworks and libraries. The three-tiered approach of this book starts by focusing on building a strong foundation around the basics of Machine Learning and relevant tools and frameworks, the next part emphasizes the core processes around building Machine Learning pipelines, and the final part leverages this knowledge on solving some real-world case studies from diverse domains, including retail, transportation, movies, music, computer vision, art, and finance. We also cover a wide range of Machine Learning models, including regression, classification, forecasting, rule-mining, and clustering. This book also touches on cutting edge methodologies and research from the field of Deep Learning, including concepts like transfer learning and case studies relevant to computer vision, including image classification and neural style transfer. Each chapter consists of detailed concepts with complete hands-on examples, code, and detailed discussions. The main intent of this book is to give a wide range of readers—including IT professionals, analysts, developers, data scientists, engineers, and graduate students—a structured approach to gaining essential skills pertaining to Machine Learning and enough knowledge about leveraging state-of-the-art Machine Learning techniques and frameworks so that they can start solving their own real-world problems.
This book is application-focused, so it’s not a replacement for gaining deep conceptual and theoretical
knowledge about Machine Learning algorithms, methods, and their internal implementations. We strongly recommend you supplement the practical knowledge gained through this book with some standard books on data mining, statistical analysis, and theoretical aspects of Machine Learning algorithms and methods to gain deeper insights into the world of Machine Learning.

A Guide For Data Scientists

by Andreas C. Müller & Sarah Guido


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 3.00 USD
 Pages
 392 p
 File Size
 32,382 KB
 File Type
 PDF format
 ISBN
 978-1-449-36941-5
 Copyright   
 2017 Sarah Guido, Andreas Müller 

Preface
Machine learning is an integral part of many commercial applications and research
projects today, in areas ranging from medical diagnosis and treatment to finding your
friends on social networks. Many people think that machine learning can only be
applied by large companies with extensive research teams. In this book, we want to
show you how easy it can be to build machine learning solutions yourself, and how to
best go about it. With the knowledge in this book, you can build your own system for
finding out how people feel on Twitter, or making predictions about global warming.
The applications of machine learning are endless and, with the amount of data available
today, mostly limited by your imagination.

Who Should Read This Book
This book is for current and aspiring machine learning practitioners looking to
implement solutions to real-world machine learning problems. This is an introductory
book requiring no previous knowledge of machine learning or artificial intelligence
(AI). We focus on using Python and the scikit-learn library, and work
through all the steps to create a successful machine learning application. The methods
we introduce will be helpful for scientists and researchers, as well as data scientists
working on commercial applications. You will get the most out of the book if you
are somewhat familiar with Python and the NumPy and matplotlib libraries.

We made a conscious effort not to focus too much on the math, but rather on the
practical aspects of using machine learning algorithms. As mathematics (probability
theory, in particular) is the foundation upon which machine learning is built, we
won’t go into the analysis of the algorithms in great detail. If you are interested in the
mathematics of machine learning algorithms, we recommend the book The Elements
of Statistical Learning (Springer) by Trevor Hastie, Robert Tibshirani, and Jerome
Friedman, which is available for free at the authors’ website. We will also not describe
how to write machine learning algorithms from scratch, and will instead focus on
how to use the large array of models already implemented in scikit-learn and other libraries.

Navigating This Book
This book is organized roughly as follows:
• Chapter 1 introduces the fundamental concepts of machine learning and its
applications, and describes the setup we will be using throughout the book.
• Chapters 2 and 3 describe the actual machine learning algorithms that are most
widely used in practice, and discuss their advantages and shortcomings.
• Chapter 4 discusses the importance of how we represent data that is processed by
machine learning, and what aspects of the data to pay attention to.
• Chapter 5 covers advanced methods for model evaluation and parameter tuning,
with a particular focus on cross-validation and grid search.
• Chapter 6 explains the concept of pipelines for chaining models and encapsulating
your workflow.
• Chapter 7 shows how to apply the methods described in earlier chapters to text
data, and introduces some text-specific processing techniques.
• Chapter 8 offers a high-level overview, and includes references to more advanced topics.
While Chapters 2 and 3 provide the actual algorithms, understanding all of these
algorithms might not be necessary for a beginner. If you need to build a machine
learning system ASAP, we suggest starting with Chapter 1 and the opening sections of
Chapter 2, which introduce all the core concepts. You can then skip to “Summary and
Outlook” on page 127 in Chapter 2, which includes a list of all the supervised models
that we cover. Choose the model that best fits your needs and flip back to read the
section devoted to it for details. Then you can use the techniques in Chapter 5 to evaluate
and tune your model.

Table of Contents
Preface vii
1. Introduction  1
Why Machine Learning? 1
Problems Machine Learning Can Solve 2
Knowing Your Task and Knowing Your Data 4
Why Python? 5
scikit-learn 5
Installing scikit-learn 6
Essential Libraries and Tools 7
Jupyter Notebook 7
NumPy 7
SciPy 8
matplotlib 9
pandas 10
mglearn 11
Python 2 Versus Python 3 12
Versions Used in this Book 12
A First Application: Classifying Iris Species 13
Meet the Data 14
Measuring Success: Training and Testing Data 17
First Things First: Look at Your Data 19
Building Your First Model: k-Nearest Neighbors 20
Making Predictions 22
Evaluating the Model 22
Summary and Outlook 23
2. Supervised Learning. 25
Classification and Regression 25
Generalization, Overfitting, and Underfitting 26
Relation of Model Complexity to Dataset Size 29
Supervised Machine Learning Algorithms 29
Some Sample Datasets 30
k-Nearest Neighbors 35
Linear Models 45
Naive Bayes Classifiers 68
Decision Trees 70
Ensembles of Decision Trees 83
Kernelized Support Vector Machines 92
Neural Networks (Deep Learning) 104
Uncertainty Estimates from Classifiers 119
The Decision Function 120
Predicting Probabilities 122
Uncertainty in Multiclass Classification 124
Summary and Outlook 127
3. Unsupervised Learning and Preprocessing  . 131
Types of Unsupervised Learning 131
Challenges in Unsupervised Learning 132
Preprocessing and Scaling 132
Different Kinds of Preprocessing 133
Applying Data Transformations 134
Scaling Training and Test Data the Same Way 136
The Effect of Preprocessing on Supervised Learning 138
Dimensionality Reduction, Feature Extraction, and Manifold Learning 140
Principal Component Analysis (PCA) 140
Non-Negative Matrix Factorization (NMF) 156
Manifold Learning with t-SNE 163
Clustering 168
k-Means Clustering 168
Agglomerative Clustering 182
DBSCAN 187
Comparing and Evaluating Clustering Algorithms 191
Summary of Clustering Methods 207
Summary and Outlook 208
4. Representing Data and Engineering Features  . 211
Categorical Variables 212
One-Hot-Encoding (Dummy Variables) 213
Numbers Can Encode Categoricals 218
Binning, Discretization, Linear Models, and Trees 220
Interactions and Polynomials 224
Univariate Nonlinear Transformations 232
Automatic Feature Selection 236
Univariate Statistics 236
Model-Based Feature Selection 238
Iterative Feature Selection 240
Utilizing Expert Knowledge 242
Summary and Outlook 250
5. Model Evaluation and Improvement . 251
Cross-Validation 252
Cross-Validation in scikit-learn 253
Benefits of Cross-Validation 254
Stratified k-Fold Cross-Validation and Other Strategies 254
Grid Search 260
Simple Grid Search 261
The Danger of Overfitting the Parameters and the Validation Set 261
Grid Search with Cross-Validation 263
Evaluation Metrics and Scoring 275
Keep the End Goal in Mind 275
Metrics for Binary Classification 276
Metrics for Multiclass Classification 296
Regression Metrics 299
Using Evaluation Metrics in Model Selection 300
Summary and Outlook 302
6. Algorithm Chains and Pipelines  . 305
Parameter Selection with Preprocessing 306
Building Pipelines 308
Using Pipelines in Grid Searches 309
The General Pipeline Interface 312
Convenient Pipeline Creation with make_pipeline 313
Accessing Step Attributes 314
Accessing Attributes in a Grid-Searched Pipeline 315
Grid-Searching Preprocessing Steps and Model Parameters 317
Grid-Searching Which Model To Use 319
Summary and Outlook 320
7. Working with Text Data . . 323
Types of Data Represented as Strings 323
Example Application: Sentiment Analysis of Movie Reviews 325
Representing Text Data as a Bag of Words 327
Applying Bag-of-Words to a Toy Dataset 329
Bag-of-Words for Movie Reviews 330
Stopwords 334
Rescaling the Data with tf–idf 336
Investigating Model Coefficients 338
Bag-of-Words with More Than One Word (n-Grams) 339
Advanced Tokenization, Stemming, and Lemmatization 344
Topic Modeling and Document Clustering 347
Latent Dirichlet Allocation 348
Summary and Outlook 355
8. Wrapping Up  . 357
Approaching a Machine Learning Problem 357
Humans in the Loop 358
From Prototype to Production 359
Testing Production Systems 359
Building Your Own Estimator 360
Where to Go from Here 361
Theory 361
Other Machine Learning Frameworks and Packages 362
Ranking, Recommender Systems, and Other Kinds of Learning 363
Probabilistic Modeling, Inference, and Probabilistic Programming 363
Neural Networks 364
Scaling to Larger Datasets 364
Honing Your Skills 365
Conclusion 366
Index 367


Bookscreen
e-books shop

Introduction
There are many books on machine learning and AI. However, all of them are meant
for graduate students or PhD students in computer science, and they’re full of
advanced mathematics. This is in stark contrast with how machine learning is being
used, as a commodity tool in research and commercial applications. Today, applying
machine learning does not require a PhD. However, there are few resources out there
that fully cover all the important aspects of implementing machine learning in practice,
without requiring you to take advanced math courses. We hope this book will
help people who want to apply machine learning without reading up on years’ worth
of calculus, linear algebra, and probability theory.
Loading...
DMCA.com Protection Status