About the Author ...................................................................................................xvii
About the Technical Reviewer ................................................................................xix
Table of Contents
Chapter 1: An Introduction to Data Analysis ........................................................... 1
Data Analysis ................................................................................................................................ 1
Knowledge Domains of the Data Analyst ...................................................................................... 3
Computer Science ................................................................................................................... 3
Mathematics and Statistics ..................................................................................................... 4
Machine Learning and Artificial Intelligence ........................................................................... 5
Professional Fields of Application ........................................................................................... 5
Understanding the Nature of the Data .......................................................................................... 5
When the Data Become Information ........................................................................................ 6
When the Information Becomes Knowledge ........................................................................... 6
Types of Data ........................................................................................................................... 6
The Data Analysis Process ............................................................................................................ 6
Problem Definition ................................................................................................................... 8
Data Extraction ........................................................................................................................ 9
Data Preparation .................................................................................................................... 10
Data Exploration/Visualization ............................................................................................... 10
Predictive Modeling ............................................................................................................... 12
Model Validation .................................................................................................................... 13
Deployment ........................................................................................................................... 13
Quantitative and Qualitative Data Analysis ................................................................................. 14
Open Data .................................................................................................................................. 15
Python and Data Analysis ............................................................................................................ 17
Conclusions ................................................................................................................................. 17
Chapter 2: Introduction to the Python World ......................................................... 19
Python—The Programming Language ........................................................................................ 19
Python—The Interpreter ....................................................................................................... 21
Python 2 and Python 3 ................................................................................................................ 23
Installing Python .................................................................................................................... 23
Python Distributions .............................................................................................................. 24
Using Python .......................................................................................................................... 26
Writing Python Code .............................................................................................................. 28
IPython ................................................................................................................................... 35
PyPI—The Python Package Index ............................................................................................... 39
The IDEs for Python ............................................................................................................... 40
SciPy ........................................................................................................................................... 46
NumPy ................................................................................................................................... 47
Pandas ................................................................................................................................... 47
matplotlib .............................................................................................................................. 48
Conclusions ................................................................................................................................. 48
Chapter 3: The NumPy Library ................................................................................ 49
NumPy: A Little History ............................................................................................................... 49
The NumPy Installation ............................................................................................................... 50
Ndarray: The Heart of the Library ................................................................................................ 50
Create an Array ...................................................................................................................... 52
Types of Data ......................................................................................................................... 53
The dtype Option ................................................................................................................... 54
Intrinsic Creation of an Array ................................................................................................. 55
Basic Operations ......................................................................................................................... 57
Arithmetic Operators ............................................................................................................. 57
The Matrix Product ................................................................................................................ 59
Increment and Decrement Operators .................................................................................... 60
Universal Functions (ufunc) ................................................................................................... 61
Aggregate Functions ............................................................................................................. 62
Indexing, Slicing, and Iterating .................................................................................................... 62
Indexing ................................................................................................................................. 63
Slicing .................................................................................................................................... 65
Iterating an Array ................................................................................................................... 67
Conditions and Boolean Arrays ................................................................................................... 69
Shape Manipulation .................................................................................................................... 70
Array Manipulation ...................................................................................................................... 71
Joining Arrays ........................................................................................................................ 71
Splitting Arrays ...................................................................................................................... 72
General Concepts ........................................................................................................................ 74
Copies or Views of Objects .................................................................................................... 75
Vectorization .......................................................................................................................... 76
Broadcasting ......................................................................................................................... 76
Structured Arrays ........................................................................................................................ 79
Reading and Writing Array Data on Files ..................................................................................... 82
Loading and Saving Data in Binary Files ............................................................................... 82
Reading Files with Tabular Data ............................................................................................ 83
Conclusions ................................................................................................................................. 84
Chapter 4: The pandas Library—An Introduction ................................................... 87
pandas: The Python Data Analysis Library .................................................................................. 87
Installation of pandas .................................................................................................................. 88
Installation from Anaconda .................................................................................................... 88
Installation from PyPI ............................................................................................................ 89
Installation on Linux .............................................................................................................. 90
Installation from Source ........................................................................................................ 90
A Module Repository for Windows ......................................................................................... 90
Testing Your pandas Installation ................................................................................................. 91
Getting Started with pandas ....................................................................................................... 92
Introduction to pandas Data Structures ...................................................................................... 92
The Series .............................................................................................................................. 93
The DataFrame .................................................................................................................... 102
The Index Objects ................................................................................................................ 112
Other Functionalities on Indexes ............................................................................................... 114
Reindexing ........................................................................................................................... 114
Dropping ............................................................................................................................. 117
Arithmetic and Data Alignment ............................................................................................ 118
Operations Between Data Structures ........................................................................................ 120
Flexible Arithmetic Methods ................................................................................................ 120
Operations Between DataFrame and Series ........................................................................ 121
Function Application and Mapping ............................................................................................ 122
Functions by Element .......................................................................................................... 123
Functions by Row or Column ............................................................................................... 123
Statistics Functions ............................................................................................................. 125
Sorting and Ranking ................................................................................................................. 126
Correlation and Covariance ....................................................................................................... 129
“Not a Number” Data ................................................................................................................ 131
Assigning a NaN Value ......................................................................................................... 131
Filtering Out NaN Values ...................................................................................................... 132
Filling in NaN Occurrences .................................................................................................. 133
Hierarchical Indexing and Leveling ........................................................................................... 134
Reordering and Sorting Levels ............................................................................................ 137
Summary Statistic by Level ................................................................................................. 138
Conclusions ............................................................................................................................... 139
Chapter 5: pandas: Reading and Writing Data ...................................................... 141
I/O API Tools .............................................................................................................................. 141
CSV and Textual Files ................................................................................................................ 142
Reading Data in CSV or Text Files ............................................................................................. 143
Using RegExp to Parse TXT Files ......................................................................................... 146
Reading TXT Files Into Parts ................................................................................................ 148
Writing Data in CSV ............................................................................................................. 150
Reading and Writing HTML Files ............................................................................................... 152
Writing Data in HTML ........................................................................................................... 153
Reading Data from an HTML File ......................................................................................... 155
Reading Data from XML ............................................................................................................ 157
Reading and Writing Data on Microsoft Excel Files .................................................................. 159
JSON Data ................................................................................................................................. 162
The Format HDF5 ...................................................................................................................... 166
Pickle—Python Object Serialization ......................................................................................... 168
Serialize a Python Object with cPickle ................................................................................ 168
Pickling with pandas ........................................................................................................... 169
Interacting with Databases ....................................................................................................... 170
Loading and Writing Data with SQLite3 ............................................................................... 171
Loading and Writing Data with PostgreSQL ......................................................................... 174
Reading and Writing Data with a NoSQL Database: MongoDB .................................................. 178
Conclusions ............................................................................................................................... 180
Chapter 6: pandas in Depth: Data Manipulation ................................................... 181
Data Preparation ....................................................................................................................... 181
Merging ............................................................................................................................... 182
Concatenating ........................................................................................................................... 188
Combining ........................................................................................................................... 191
Pivoting ................................................................................................................................ 193
Removing ............................................................................................................................. 196
Data Transformation .................................................................................................................. 197
Removing Duplicates ........................................................................................................... 198
Mapping ............................................................................................................................... 199
Discretization and Binning ........................................................................................................ 204
Detecting and Filtering Outliers ........................................................................................... 209
Permutation .............................................................................................................................. 210
Random Sampling ............................................................................................................... 211
String Manipulation ................................................................................................................... 212
Built-in Methods for String Manipulation ............................................................................ 212
Regular Expressions ............................................................................................................ 214
Data Aggregation ...................................................................................................................... 217
GroupBy ............................................................................................................................... 218
A Practical Example ............................................................................................................. 219
Hierarchical Grouping .......................................................................................................... 220
Group Iteration .......................................................................................................................... 222
Chain of Transformations ..................................................................................................... 222
Functions on Groups ............................................................................................................ 224
Advanced Data Aggregation ...................................................................................................... 225
Conclusions ............................................................................................................................... 229
Chapter 7: Data Visualization with matplotlib ...................................................... 231
The matplotlib Library ............................................................................................................... 231
Installation ................................................................................................................................ 233
The IPython and IPython QtConsole .......................................................................................... 233
The matplotlib Architecture ....................................................................................................... 235
Backend Layer ..................................................................................................................... 236
Artist Layer .......................................................................................................................... 236
Scripting Layer (pyplot) ....................................................................................................... 238
pylab and pyplot .................................................................................................................. 238
pyplot ........................................................................................................................................ 239
A Simple Interactive Chart ................................................................................................... 239
The Plotting Window ................................................................................................................. 241
Set the Properties of the Plot .............................................................................................. 243
matplotlib and NumPy ......................................................................................................... 246
Using the kwargs ...................................................................................................................... 248
Working with Multiple Figures and Axes ............................................................................. 249
Adding Elements to the Chart ................................................................................................... 251
Adding Text .......................................................................................................................... 251
Adding a Grid ....................................................................................................................... 256
Adding a Legend .................................................................................................................. 257
Saving Your Charts .................................................................................................................... 260
Saving the Code ................................................................................................................... 260
Converting Your Session to an HTML File ............................................................................ 262
Saving Your Chart Directly as an Image ............................................................................... 264
Handling Date Values ............................................................................................................... 264
Chart Typology ........................................................................................................................... 267
Line Charts ................................................................................................................................ 267
Line Charts with pandas ...................................................................................................... 276
Histograms ................................................................................................................................ 277
Bar Charts ................................................................................................................................. 278
Horizontal Bar Charts .......................................................................................................... 281
Multiserial Bar Charts .......................................................................................................... 282
Multiseries Bar Charts with pandas Dataframe ................................................................... 285
Multiseries Stacked Bar Charts ........................................................................................... 286
Stacked Bar Charts with a pandas Dataframe .................................................................... 290
Other Bar Chart Representations ......................................................................................... 291
Pie Charts .................................................................................................................................. 292
Pie Charts with a pandas Dataframe ................................................................................... 296
Advanced Charts ....................................................................................................................... 297
Contour Plots ....................................................................................................................... 297
Polar Charts ......................................................................................................................... 299
The mplot3d Toolkit ................................................................................................................... 302
3D Surfaces ......................................................................................................................... 302
Scatter Plots in 3D ............................................................................................................... 304
Bar Charts in 3D .................................................................................................................. 306
Multi-Panel Plots ....................................................................................................................... 307
Display Subplots Within Other Subplots .............................................................................. 307
Grids of Subplots ................................................................................................................. 309
Conclusions ............................................................................................................................... 312
Chapter 8: Machine Learning with scikit-learn .................................................... 313
The scikit-learn Library ............................................................................................................. 313
Machine Learning ..................................................................................................................... 313
Supervised and Unsupervised Learning .............................................................................. 314
Training Set and Testing Set ................................................................................................ 315
Supervised Learning with scikit-learn ...................................................................................... 315
The Iris Flower Dataset ............................................................................................................. 316
The PCA Decomposition ...................................................................................................... 320
K-Nearest Neighbors Classifier ................................................................................................. 322
Diabetes Dataset ....................................................................................................................... 327
Linear Regression: The Least Square Regression ..................................................................... 328
Support Vector Machines (SVMs) .............................................................................................. 334
Support Vector Classification (SVC) ..................................................................................... 334
Nonlinear SVC ...................................................................................................................... 339
Plotting Different SVM Classifiers Using the Iris Dataset .................................................... 342
Support Vector Regression (SVR) ........................................................................................ 345
Conclusions ............................................................................................................................... 347
Chapter 9: Deep Learning with TensorFlow .......................................................... 349
Artificial Intelligence, Machine Learning, and Deep Learning ................................................... 349
Artificial intelligence ............................................................................................................ 350
Machine Learning Is a Branch of Artificial Intelligence ....................................................... 351
Deep Learning Is a Branch of Machine Learning ................................................................. 351
The Relationship Between Artificial Intelligence, Machine Learning, and Deep Learning ... 351
Deep Learning ........................................................................................................................... 352
Neural Networks and GPUs ................................................................................................. 352
Data Availability: Open Data Source, Internet of Things, and Big Data ................................ 353
Python ................................................................................................................................. 354
Deep Learning Python Frameworks .................................................................................... 354
Artificial Neural Networks ......................................................................................................... 355
How Artificial Neural Networks Are Structured ................................................................... 355
Single Layer Perceptron (SLP) ............................................................................................. 357
Multi Layer Perceptron (MLP) .............................................................................................. 360
Correspondence Between Artificial and Biological Neural Networks .................................. 361
TensorFlow ................................................................................................................................ 362
TensorFlow: Google’s Framework ........................................................................................ 362
TensorFlow: Data Flow Graph .............................................................................................. 362
Start Programming with TensorFlow ......................................................................................... 363
Installing TensorFlow ........................................................................................................... 363
Programming with the IPython QtConsole ........................................................................... 364
The Model and Sessions in TensorFlow ............................................................................... 364
Tensors ................................................................................................................................ 366
Operation on Tensors ........................................................................................................... 370
Single Layer Perceptron with TensorFlow ................................................................................. 371
Before Starting .................................................................................................................... 372
Data To Be Analyzed ............................................................................................................ 372
The SLP Model Definition .................................................................................................... 374
Learning Phase .................................................................................................................... 378
Test Phase and Accuracy Calculation .................................................................................. 383
Multi Layer Perceptron (with One Hidden Layer) with TensorFlow ........................................... 386
The MLP Model Definition .................................................................................................... 387
Learning Phase .................................................................................................................... 389
Test Phase and Accuracy Calculation .................................................................................. 395
Multi Layer Perceptron (with Two Hidden Layers) with TensorFlow .......................................... 397
Test Phase and Accuracy Calculation .................................................................................. 402
Evaluation of Experimental Data ......................................................................................... 404
Conclusions ............................................................................................................................... 407
Chapter 10: An Example— Meteorological Data .................................................. 409
A Hypothesis to Be Tested: The Influence of the Proximity of the Sea ...................................... 409
The System in the Study: The Adriatic Sea and the Po Valley ............................................. 410
Finding the Data Source ............................................................................................................ 414
Data Analysis on Jupyter Notebook .......................................................................................... 415
Analysis of Processed Meteorological Data .............................................................................. 421
The RoseWind ........................................................................................................................... 436
Calculating the Mean Distribution of the Wind Speed ......................................................... 441
Conclusions ............................................................................................................................... 443
Chapter 11: Embedding the JavaScript D3 Library in the IPython Notebook ....... 445
The Open Data Source for Demographics ................................................................................. 445
The JavaScript D3 Library ......................................................................................................... 449
Drawing a Clustered Bar Chart ................................................................................................. 454
The Choropleth Maps ................................................................................................................ 459
The Choropleth Map of the U.S. Population in 2014 .................................................................. 464
Conclusions ............................................................................................................................... 471
Chapter 12: Recognizing Handwritten Digits ........................................................ 473
Handwriting Recognition ........................................................................................................... 473
Recognizing Handwritten Digits with scikit-learn ..................................................................... 474
The Digits Dataset ..................................................................................................................... 475
Learning and Predicting ............................................................................................................ 478
Recognizing Handwritten Digits with TensorFlow ..................................................................... 480
Learning and Predicting ............................................................................................................ 482
Conclusions ............................................................................................................................... 486
Chapter 13: Textual Data Analysis with NLTK ....................................................... 487
Text Analysis Techniques .......................................................................................................... 487
The Natural Language Toolkit (NLTK) ................................................................................... 488
Import the NLTK Library and the NLTK Downloader Tool ..................................................... 489
Search for a Word with NLTK ............................................................................................... 493
Analyze the Frequency of Words ......................................................................................... 494
Selection of Words from Text ............................................................................................... 497
Bigrams and Collocations .................................................................................................... 498
Use Text on the Network ........................................................................................................... 500
Extract the Text from the HTML Pages ................................................................................ 501
Sentimental Analysis ........................................................................................................... 502
Conclusions ............................................................................................................................... 506
Chapter 14: Image Analysis and Computer Vision with OpenCV .......................... 507
Image Analysis and Computer Vision ........................................................................................ 507
OpenCV and Python ................................................................................................................... 508
OpenCV and Deep Learning ...................................................................................................... 509
Installing OpenCV ...................................................................................................................... 509
First Approaches to Image Processing and Analysis ................................................................ 509
Before Starting .................................................................................................................... 510
Load and Display an Image ................................................................................................. 510
Working with Images ........................................................................................................... 512
Save the New Image ........................................................................................................... 514
Elementary Operations on Images ...................................................................................... 514
Image Blending .................................................................................................................... 520
Image Analysis .......................................................................................................................... 521
Edge Detection and Image Gradient Analysis ........................................................................... 522
Edge Detection .................................................................................................................... 522
The Image Gradient Theory ................................................................................................. 523
A Practical Example of Edge Detection with the Image Gradient Analysis .......................... 525
A Deep Learning Example: The Face Detection ......................................................................... 532
Conclusions ............................................................................................................................... 535
Appendix A: Writing Mathematical Expressions with LaTeX ................................ 537
With matplotlib .......................................................................................................................... 537
With IPython Notebook in a Markdown Cell .............................................................................. 537
With IPython Notebook in a Python 2 Cell ................................................................................. 538
Subscripts and Superscripts ..................................................................................................... 538
Fractions, Binomials, and Stacked Numbers ............................................................................ 538
Radicals .................................................................................................................................... 539
Fonts ......................................................................................................................................... 539
Accents ..................................................................................................................................... 540
Appendix B: Open Data Sources ........................................................................... 549
Political and Government Data .................................................................................................. 549
Health Data ............................................................................................................................... 550
Social Data ................................................................................................................................ 550
Miscellaneous and Public Data Sets ......................................................................................... 551
Financial Data ........................................................................................................................... 552
Climatic Data ............................................................................................................................. 552
Sports Data ............................................................................................................................... 553
Publications, Newspapers, and Books ...................................................................................... 553
Musical Data ............................................................................................................................. 553
Index ..................................................................................................................... 555