Showing posts with label Database. Show all posts

With Pandas, NumPy, and Matplotlib

Fabio Nelli


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 3.00
 Pages
 576 p
 File Size 
 14,283 KB
 File Type
 PDF format
 ISBN
 978-1-4842-3912-4 (pbk)
 978-1-4842-3913-1 (electronic) 
 Copyright©   
 2018 by Fabio Nelli

About the Author
Fabio Nelli is a data scientist and Python consultant, designing and developing Python
applications for data analysis and visualization. He has experience with the scientific
world, having performed various data analysis roles in pharmaceutical chemistry for
private research companies and universities. He has been a computer consultant for
many years at IBM, EDS, and Hewlett-Packard, along with several banks and insurance
companies. He has an organic chemistry master’s degree and a bachelor’s degree in
information technologies and automation systems, with many years of experience in
life sciences (as as Tech Specialist at Beckman Coulter, Tecan, Sciex).
For further info and other examples, 

About the Technical Reviewer
Raul Samayoa is a senior software developer and machine
learning specialist with many years of experience in the
financial industry. An MSc graduate from the Georgia
Institute of Technology, he's never met a neural network or
dataset he did not like. He's fond of evangelizing the use of
DevOps tools for data science and software development.
Raul enjoys the energy of his hometown of Toronto,
Canada, where he runs marathons, volunteers as a
technology instructor with the University of Toronto
coders, and likes to work with data in Python and R.

Table of Contents
About the Author ...................................................................................................xvii
About the Technical Reviewer ................................................................................xix
Table of Contents
Chapter 1: An Introduction to Data Analysis ........................................................... 1
Data Analysis ................................................................................................................................ 1
Knowledge Domains of the Data Analyst ...................................................................................... 3
Computer Science ................................................................................................................... 3
Mathematics and Statistics ..................................................................................................... 4
Machine Learning and Artificial Intelligence ........................................................................... 5
Professional Fields of Application ........................................................................................... 5
Understanding the Nature of the Data .......................................................................................... 5
When the Data Become Information ........................................................................................ 6
When the Information Becomes Knowledge ........................................................................... 6
Types of Data ........................................................................................................................... 6
The Data Analysis Process ............................................................................................................ 6
Problem Definition ................................................................................................................... 8
Data Extraction ........................................................................................................................ 9
Data Preparation .................................................................................................................... 10
Data Exploration/Visualization ............................................................................................... 10
Predictive Modeling ............................................................................................................... 12
Model Validation .................................................................................................................... 13
Deployment ........................................................................................................................... 13
Quantitative and Qualitative Data Analysis ................................................................................. 14
Open Data .................................................................................................................................. 15
Python and Data Analysis ............................................................................................................ 17
Conclusions ................................................................................................................................. 17
Chapter 2: Introduction to the Python World ......................................................... 19
Python—The Programming Language ........................................................................................ 19
Python—The Interpreter ....................................................................................................... 21
Python 2 and Python 3 ................................................................................................................ 23
Installing Python .................................................................................................................... 23
Python Distributions .............................................................................................................. 24
Using Python .......................................................................................................................... 26
Writing Python Code .............................................................................................................. 28
IPython ................................................................................................................................... 35
PyPI—The Python Package Index ............................................................................................... 39
The IDEs for Python ............................................................................................................... 40
SciPy ........................................................................................................................................... 46
NumPy ................................................................................................................................... 47
Pandas ................................................................................................................................... 47
matplotlib .............................................................................................................................. 48
Conclusions ................................................................................................................................. 48
Chapter 3: The NumPy Library ................................................................................ 49
NumPy: A Little History ............................................................................................................... 49
The NumPy Installation ............................................................................................................... 50
Ndarray: The Heart of the Library ................................................................................................ 50
Create an Array ...................................................................................................................... 52
Types of Data ......................................................................................................................... 53
The dtype Option ................................................................................................................... 54
Intrinsic Creation of an Array ................................................................................................. 55
Basic Operations ......................................................................................................................... 57
Arithmetic Operators ............................................................................................................. 57
The Matrix Product ................................................................................................................ 59
Increment and Decrement Operators .................................................................................... 60
Universal Functions (ufunc) ................................................................................................... 61
Aggregate Functions ............................................................................................................. 62
Indexing, Slicing, and Iterating .................................................................................................... 62
Indexing ................................................................................................................................. 63
Slicing .................................................................................................................................... 65
Iterating an Array ................................................................................................................... 67
Conditions and Boolean Arrays ................................................................................................... 69
Shape Manipulation .................................................................................................................... 70
Array Manipulation ...................................................................................................................... 71
Joining Arrays ........................................................................................................................ 71
Splitting Arrays ...................................................................................................................... 72
General Concepts ........................................................................................................................ 74
Copies or Views of Objects .................................................................................................... 75
Vectorization .......................................................................................................................... 76
Broadcasting ......................................................................................................................... 76
Structured Arrays ........................................................................................................................ 79
Reading and Writing Array Data on Files ..................................................................................... 82
Loading and Saving Data in Binary Files ............................................................................... 82
Reading Files with Tabular Data ............................................................................................ 83
Conclusions ................................................................................................................................. 84
Chapter 4: The pandas Library—An Introduction ................................................... 87
pandas: The Python Data Analysis Library .................................................................................. 87
Installation of pandas .................................................................................................................. 88
Installation from Anaconda .................................................................................................... 88
Installation from PyPI ............................................................................................................ 89
Installation on Linux .............................................................................................................. 90
Installation from Source ........................................................................................................ 90
A Module Repository for Windows ......................................................................................... 90
Testing Your pandas Installation ................................................................................................. 91
Getting Started with pandas ....................................................................................................... 92
Introduction to pandas Data Structures ...................................................................................... 92
The Series .............................................................................................................................. 93
The DataFrame .................................................................................................................... 102
The Index Objects ................................................................................................................ 112
Other Functionalities on Indexes ............................................................................................... 114
Reindexing ........................................................................................................................... 114
Dropping ............................................................................................................................. 117
Arithmetic and Data Alignment ............................................................................................ 118
Operations Between Data Structures ........................................................................................ 120
Flexible Arithmetic Methods ................................................................................................ 120
Operations Between DataFrame and Series ........................................................................ 121
Function Application and Mapping ............................................................................................ 122
Functions by Element .......................................................................................................... 123
Functions by Row or Column ............................................................................................... 123
Statistics Functions ............................................................................................................. 125
Sorting and Ranking ................................................................................................................. 126
Correlation and Covariance ....................................................................................................... 129
“Not a Number” Data ................................................................................................................ 131
Assigning a NaN Value ......................................................................................................... 131
Filtering Out NaN Values ...................................................................................................... 132
Filling in NaN Occurrences .................................................................................................. 133
Hierarchical Indexing and Leveling ........................................................................................... 134
Reordering and Sorting Levels ............................................................................................ 137
Summary Statistic by Level ................................................................................................. 138
Conclusions ............................................................................................................................... 139
Chapter 5: pandas: Reading and Writing Data ...................................................... 141
I/O API Tools .............................................................................................................................. 141
CSV and Textual Files ................................................................................................................ 142
Reading Data in CSV or Text Files ............................................................................................. 143
Using RegExp to Parse TXT Files ......................................................................................... 146
Reading TXT Files Into Parts ................................................................................................ 148
Writing Data in CSV ............................................................................................................. 150
Reading and Writing HTML Files ............................................................................................... 152
Writing Data in HTML ........................................................................................................... 153
Reading Data from an HTML File ......................................................................................... 155
Reading Data from XML ............................................................................................................ 157
Reading and Writing Data on Microsoft Excel Files .................................................................. 159
JSON Data ................................................................................................................................. 162
The Format HDF5 ...................................................................................................................... 166
Pickle—Python Object Serialization ......................................................................................... 168
Serialize a Python Object with cPickle ................................................................................ 168
Pickling with pandas ........................................................................................................... 169
Interacting with Databases ....................................................................................................... 170
Loading and Writing Data with SQLite3 ............................................................................... 171
Loading and Writing Data with PostgreSQL ......................................................................... 174
Reading and Writing Data with a NoSQL Database: MongoDB .................................................. 178
Conclusions ............................................................................................................................... 180
Chapter 6: pandas in Depth: Data Manipulation ................................................... 181
Data Preparation ....................................................................................................................... 181
Merging ............................................................................................................................... 182
Concatenating ........................................................................................................................... 188
Combining ........................................................................................................................... 191
Pivoting ................................................................................................................................ 193
Removing ............................................................................................................................. 196
Data Transformation .................................................................................................................. 197
Removing Duplicates ........................................................................................................... 198
Mapping ............................................................................................................................... 199
Discretization and Binning ........................................................................................................ 204
Detecting and Filtering Outliers ........................................................................................... 209
Permutation .............................................................................................................................. 210
Random Sampling ............................................................................................................... 211
String Manipulation ................................................................................................................... 212
Built-in Methods for String Manipulation ............................................................................ 212
Regular Expressions ............................................................................................................ 214
Data Aggregation ...................................................................................................................... 217
GroupBy ............................................................................................................................... 218
A Practical Example ............................................................................................................. 219
Hierarchical Grouping .......................................................................................................... 220
Group Iteration .......................................................................................................................... 222
Chain of Transformations ..................................................................................................... 222
Functions on Groups ............................................................................................................ 224
Advanced Data Aggregation ...................................................................................................... 225
Conclusions ............................................................................................................................... 229
Chapter 7: Data Visualization with matplotlib ...................................................... 231
The matplotlib Library ............................................................................................................... 231
Installation ................................................................................................................................ 233
The IPython and IPython QtConsole .......................................................................................... 233
The matplotlib Architecture ....................................................................................................... 235
Backend Layer ..................................................................................................................... 236
Artist Layer .......................................................................................................................... 236
Scripting Layer (pyplot) ....................................................................................................... 238
pylab and pyplot .................................................................................................................. 238
pyplot ........................................................................................................................................ 239
A Simple Interactive Chart ................................................................................................... 239
The Plotting Window ................................................................................................................. 241
Set the Properties of the Plot .............................................................................................. 243
matplotlib and NumPy ......................................................................................................... 246
Using the kwargs ...................................................................................................................... 248
Working with Multiple Figures and Axes ............................................................................. 249
Adding Elements to the Chart ................................................................................................... 251
Adding Text .......................................................................................................................... 251
Adding a Grid ....................................................................................................................... 256
Adding a Legend .................................................................................................................. 257
Saving Your Charts .................................................................................................................... 260
Saving the Code ................................................................................................................... 260
Converting Your Session to an HTML File ............................................................................ 262
Saving Your Chart Directly as an Image ............................................................................... 264
Handling Date Values ............................................................................................................... 264
Chart Typology ........................................................................................................................... 267
Line Charts ................................................................................................................................ 267
Line Charts with pandas ...................................................................................................... 276
Histograms ................................................................................................................................ 277
Bar Charts ................................................................................................................................. 278
Horizontal Bar Charts .......................................................................................................... 281
Multiserial Bar Charts .......................................................................................................... 282
Multiseries Bar Charts with pandas Dataframe ................................................................... 285
Multiseries Stacked Bar Charts ........................................................................................... 286
Stacked Bar Charts with a pandas Dataframe .................................................................... 290
Other Bar Chart Representations ......................................................................................... 291
Pie Charts .................................................................................................................................. 292
Pie Charts with a pandas Dataframe ................................................................................... 296
Advanced Charts ....................................................................................................................... 297
Contour Plots ....................................................................................................................... 297
Polar Charts ......................................................................................................................... 299
The mplot3d Toolkit ................................................................................................................... 302
3D Surfaces ......................................................................................................................... 302
Scatter Plots in 3D ............................................................................................................... 304
Bar Charts in 3D .................................................................................................................. 306
Multi-Panel Plots ....................................................................................................................... 307
Display Subplots Within Other Subplots .............................................................................. 307
Grids of Subplots ................................................................................................................. 309
Conclusions ............................................................................................................................... 312
Chapter 8: Machine Learning with scikit-learn .................................................... 313
The scikit-learn Library ............................................................................................................. 313
Machine Learning ..................................................................................................................... 313
Supervised and Unsupervised Learning .............................................................................. 314
Training Set and Testing Set ................................................................................................ 315
Supervised Learning with scikit-learn ...................................................................................... 315
The Iris Flower Dataset ............................................................................................................. 316
The PCA Decomposition ...................................................................................................... 320
K-Nearest Neighbors Classifier ................................................................................................. 322
Diabetes Dataset ....................................................................................................................... 327
Linear Regression: The Least Square Regression ..................................................................... 328
Support Vector Machines (SVMs) .............................................................................................. 334
Support Vector Classification (SVC) ..................................................................................... 334
Nonlinear SVC ...................................................................................................................... 339
Plotting Different SVM Classifiers Using the Iris Dataset .................................................... 342
Support Vector Regression (SVR) ........................................................................................ 345
Conclusions ............................................................................................................................... 347
Chapter 9: Deep Learning with TensorFlow .......................................................... 349
Artificial Intelligence, Machine Learning, and Deep Learning ................................................... 349
Artificial intelligence ............................................................................................................ 350
Machine Learning Is a Branch of Artificial Intelligence ....................................................... 351
Deep Learning Is a Branch of Machine Learning ................................................................. 351
The Relationship Between Artificial Intelligence, Machine Learning, and Deep Learning ... 351
Deep Learning ........................................................................................................................... 352
Neural Networks and GPUs ................................................................................................. 352
Data Availability: Open Data Source, Internet of Things, and Big Data ................................ 353
Python ................................................................................................................................. 354
Deep Learning Python Frameworks .................................................................................... 354
Artificial Neural Networks ......................................................................................................... 355
How Artificial Neural Networks Are Structured ................................................................... 355
Single Layer Perceptron (SLP) ............................................................................................. 357
Multi Layer Perceptron (MLP) .............................................................................................. 360
Correspondence Between Artificial and Biological Neural Networks .................................. 361
TensorFlow ................................................................................................................................ 362
TensorFlow: Google’s Framework ........................................................................................ 362
TensorFlow: Data Flow Graph .............................................................................................. 362
Start Programming with TensorFlow ......................................................................................... 363
Installing TensorFlow ........................................................................................................... 363
Programming with the IPython QtConsole ........................................................................... 364
The Model and Sessions in TensorFlow ............................................................................... 364
Tensors ................................................................................................................................ 366
Operation on Tensors ........................................................................................................... 370
Single Layer Perceptron with TensorFlow ................................................................................. 371
Before Starting .................................................................................................................... 372
Data To Be Analyzed ............................................................................................................ 372
The SLP Model Definition .................................................................................................... 374
Learning Phase .................................................................................................................... 378
Test Phase and Accuracy Calculation .................................................................................. 383
Multi Layer Perceptron (with One Hidden Layer) with TensorFlow ........................................... 386
The MLP Model Definition .................................................................................................... 387
Learning Phase .................................................................................................................... 389
Test Phase and Accuracy Calculation .................................................................................. 395
Multi Layer Perceptron (with Two Hidden Layers) with TensorFlow .......................................... 397
Test Phase and Accuracy Calculation .................................................................................. 402
Evaluation of Experimental Data ......................................................................................... 404
Conclusions ............................................................................................................................... 407
Chapter 10: An Example— Meteorological Data .................................................. 409
A Hypothesis to Be Tested: The Influence of the Proximity of the Sea ...................................... 409
The System in the Study: The Adriatic Sea and the Po Valley ............................................. 410
Finding the Data Source ............................................................................................................ 414
Data Analysis on Jupyter Notebook .......................................................................................... 415
Analysis of Processed Meteorological Data .............................................................................. 421
The RoseWind ........................................................................................................................... 436
Calculating the Mean Distribution of the Wind Speed ......................................................... 441
Conclusions ............................................................................................................................... 443
Chapter 11: Embedding the JavaScript D3 Library in the IPython Notebook ....... 445
The Open Data Source for Demographics ................................................................................. 445
The JavaScript D3 Library ......................................................................................................... 449
Drawing a Clustered Bar Chart ................................................................................................. 454
The Choropleth Maps ................................................................................................................ 459
The Choropleth Map of the U.S. Population in 2014 .................................................................. 464
Conclusions ............................................................................................................................... 471
Chapter 12: Recognizing Handwritten Digits ........................................................ 473
Handwriting Recognition ........................................................................................................... 473
Recognizing Handwritten Digits with scikit-learn ..................................................................... 474
The Digits Dataset ..................................................................................................................... 475
Learning and Predicting ............................................................................................................ 478
Recognizing Handwritten Digits with TensorFlow ..................................................................... 480
Learning and Predicting ............................................................................................................ 482
Conclusions ............................................................................................................................... 486
Chapter 13: Textual Data Analysis with NLTK ....................................................... 487
Text Analysis Techniques .......................................................................................................... 487
The Natural Language Toolkit (NLTK) ................................................................................... 488
Import the NLTK Library and the NLTK Downloader Tool ..................................................... 489
Search for a Word with NLTK ............................................................................................... 493
Analyze the Frequency of Words ......................................................................................... 494
Selection of Words from Text ............................................................................................... 497
Bigrams and Collocations .................................................................................................... 498
Use Text on the Network ........................................................................................................... 500
Extract the Text from the HTML Pages ................................................................................ 501
Sentimental Analysis ........................................................................................................... 502
Conclusions ............................................................................................................................... 506
Chapter 14: Image Analysis and Computer Vision with OpenCV .......................... 507
Image Analysis and Computer Vision ........................................................................................ 507
OpenCV and Python ................................................................................................................... 508
OpenCV and Deep Learning ...................................................................................................... 509
Installing OpenCV ...................................................................................................................... 509
First Approaches to Image Processing and Analysis ................................................................ 509
Before Starting .................................................................................................................... 510
Load and Display an Image ................................................................................................. 510
Working with Images ........................................................................................................... 512
Save the New Image ........................................................................................................... 514
Elementary Operations on Images ...................................................................................... 514
Image Blending .................................................................................................................... 520
Image Analysis .......................................................................................................................... 521
Edge Detection and Image Gradient Analysis ........................................................................... 522
Edge Detection .................................................................................................................... 522
The Image Gradient Theory ................................................................................................. 523
A Practical Example of Edge Detection with the Image Gradient Analysis .......................... 525
A Deep Learning Example: The Face Detection ......................................................................... 532
Conclusions ............................................................................................................................... 535
Appendix A: Writing Mathematical Expressions with LaTeX ................................ 537
With matplotlib .......................................................................................................................... 537
With IPython Notebook in a Markdown Cell .............................................................................. 537
With IPython Notebook in a Python 2 Cell ................................................................................. 538
Subscripts and Superscripts ..................................................................................................... 538
Fractions, Binomials, and Stacked Numbers ............................................................................ 538
Radicals .................................................................................................................................... 539
Fonts ......................................................................................................................................... 539
Accents ..................................................................................................................................... 540
Appendix B: Open Data Sources ........................................................................... 549
Political and Government Data .................................................................................................. 549
Health Data ............................................................................................................................... 550
Social Data ................................................................................................................................ 550
Miscellaneous and Public Data Sets ......................................................................................... 551
Financial Data ........................................................................................................................... 552
Climatic Data ............................................................................................................................. 552
Sports Data ............................................................................................................................... 553
Publications, Newspapers, and Books ...................................................................................... 553
Musical Data ............................................................................................................................. 553
Index ..................................................................................................................... 555


Bookscreen
e-books shop

Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Todd Green
Development Editor: James Markham
Coordinating Editor: Jill Balzano
Cover image designed by Freepik (www.freepik.com)

by Luca Massaron and John Paul Mueller


e-books shop
e-books shop
Purchase Now !
Just with Paypal



Book Details
 Price
 3.00
 Pages
 435 p
 File Size 
 9,952 KB
 File Type
 PDF format
 ISBN
 978‐1‐118‐84418‐2
 978-1-118-84398-7 (ebk)
 978-1-118-84414-4 (ePDF)
 Copyright©   
 2015 by John Wiley & Sons, Inc 

Introduction
You rely on data science absolutely every day to perform an amazing
array of tasks or to obtain services from someone else. In fact, you’ve
probably used data science in ways that you never expected. For example,
when you used your favorite search engine this morning to look for something,
it made suggestions on alternative search terms. Those terms are
supplied by data science. When you went to the doctor last week and
discovered the lump you found wasn’t cancer, it’s likely the doctor made his
prognosis with the help of data science. In fact, you might work with data
science every day and not even know it. Python for Data Science For Dummies
not only gets you started using data science to perform a wealth of practical
tasks but also helps you realize just how many places data science is used.
By knowing how to answer data science problems and where to employ data
science, you gain a significant advantage over everyone else, increasing your
chances at promotion or that new job you really want.

About This Book
The main purpose of Python for Data Science For Dummies is to take the scare
factor out of data science by showing you that data science is not only really
interesting but also quite doable using Python. You might assume that you
need to be a computer science genius to perform the complex tasks normally
associated with data science, but that’s far from the truth. Python comes
with a host of useful libraries that do all the heavy lifting for you in the background.
You don’t even realize how much is going on, and you don’t need to
care. All you really need to know is that you want to perform specific tasks
and that Python makes these tasks quite accessible.

Part of the emphasis of this book is on using the right tools. You start with
Anaconda, a product that includes IPython and IPython Notebook — two
tools that take the sting out of working with Python. You experiment with
IPython in a fully interactive environment. The code you place in IPython
Notebook is presentation quality, and you can mix a number of presentation
elements right there in your document. It’s not really like using a development
environment at all.

You also discover some interesting techniques in this book. For example,
you can create plots of all your data science experiments using MatPlotLib,
for which this book provides you with all the details. This book also spends
considerable time showing you just what is available and how you can use
it to perform some really interesting calculations. Many people would like to
know how to perform handwriting recognition — and if you’re one of them,
you can use this book to get a leg up on the process.

Of course, you might still be worried about the whole programming environment
issue, and this book doesn’t leave you in the dark there, either. At the
beginning, you find complete installation instructions for Anaconda and a
quick primer (with references) to the basic Python programming you need
to perform. The emphasis is on getting you up and running as quickly as
possible, and to make examples straightforward and simple so that the code
doesn’t become a stumbling block to learning.
To make absorbing the concepts even easier, this book uses the following
conventions:
✓✓Text that you’re meant to type just as it appears in the book is in bold.
The exception is when you’re working through a step list: Because each
step is bold, the text to type is not bold.
✓✓When you see words in italics as part of a typing sequence, you need to
replace that value with something that works for you. For example, if
you see “Type Your Name and press Enter,” you need to replace Your
Name with your actual name.
✓✓Web addresses and programming code appear in monofont. If you’re
reading a digital version of this book on a device connected to the
Internet, note that you can click the web address to visit that website,
✓✓When you need to type command sequences, you see them separated by
a special arrow, like this: File➪New File. In this case, you go to the File
menu first and then select the New File entry on that menu. The result is
that you see a new file created.

Table of Contents
Introduction.................................................................. 1
About This Book...............................................................................................1
Foolish Assumptions........................................................................................2
Icons Used in This Book..................................................................................3
Beyond the Book..............................................................................................4
Where to Go from Here....................................................................................5
Part I: Getting Started with Python for Data Science....... 7
Chapter 1: Discovering the Match between
Data Science and Python . 9
Defining the Sexiest Job of the 21st Century...............................................11
Considering the emergence of data science.....................................11
Outlining the core competencies of a data scientist........................12
Linking data science and big data......................................................13
Understanding the role of programming...........................................13
Creating the Data Science Pipeline...............................................................14
Preparing the data................................................................................14
Performing exploratory data analysis................................................15
Learning from data...............................................................................15
Visualizing..............................................................................................15
Obtaining insights and data products................................................15
Understanding Python’s Role in Data Science............................................16
Considering the shifting profile of data scientists............................16
Working with a multipurpose, simple, and efficient language........17
Learning to Use Python Fast.........................................................................18
Loading data..........................................................................................18
Training a model...................................................................................18
Viewing a result.....................................................................................20
Chapter 2: Introducing Python’s Capabilities and Wonders . 21
Why Python?...................................................................................................22
Grasping Python’s core philosophy...................................................23
Discovering present and future development
goals........................23
Working with Python.....................................................................................24
Getting a taste of the language............................................................24
Understanding the need for indentation...........................................25
Working at the command line or in the IDE......................................25
Performing Rapid Prototyping and Experimentation................................29
Considering Speed of Execution...................................................................30
Visualizing Power...........................................................................................32
Using the Python Ecosystem for Data Science...........................................33
Accessing scientific tools using SciPy................................................33
Performing fundamental scientific computing
using NumPy..........34
Performing data analysis using pandas.............................................34
Implementing machine learning using Scikit‐learn...........................35
Plotting the data using matplotlib......................................................35
Parsing HTML documents using Beautiful Soup...............................35
Chapter 3: Setting Up Python for Data Science . 37
Considering the Off‐the‐Shelf Cross‐Platform Scientific
Distributions................................................................................................38
Getting Continuum Analytics Anaconda............................................39
Getting Enthought Canopy Express...................................................40
Getting pythonxy..................................................................................40
Getting WinPython................................................................................41
Installing Anaconda on Windows.................................................................41
Installing Anaconda on Linux........................................................................45
Installing Anaconda on Mac OS X.................................................................46
Downloading the Datasets and Example Code...........................................47
Using IPython Notebook......................................................................47
Defining the code repository...............................................................48
Understanding the datasets used in this book.................................54
Chapter 4: Reviewing Basic Python . 57
Working with Numbers and Logic................................................................59
Performing variable assignments.......................................................60
Doing arithmetic...................................................................................61
Comparing data using Boolean expressions.....................................62
Creating and Using Strings............................................................................65
Interacting with Dates....................................................................................66
Creating and Using Functions.......................................................................68
Creating reusable functions................................................................68
Calling functions in a variety of ways.................................................70
Using Conditional and Loop Statements.....................................................73
Making decisions using the if statement............................................73
Choosing between multiple options using nested decisions..........74
Performing repetitive tasks using for.................................................75
Using the while statement...................................................................76
Storing Data Using Sets, Lists, and Tuples..................................................77
Performing operations on sets............................................................77
Working with lists.................................................................................78
Creating and using Tuples...................................................................80
Defining Useful Iterators................................................................................81
Indexing Data Using Dictionaries..................................................................82
Part II: Getting Your Hands Dirty with Data.................. 83
Chapter 5: Working with Real Data . 85
Uploading, Streaming, and Sampling Data..................................................86
Uploading small amounts of data into memory................................87
Streaming large amounts of data into memory.................................88
Sampling data........................................................................................89
Accessing Data in Structured Flat‐File Form...............................................90
Reading from a text file........................................................................91
Reading CSV delimited format............................................................92
Reading Excel and other Microsoft Office files.................................94
Sending Data in Unstructured File Form.....................................................95
Managing Data from Relational Databases..................................................98
Interacting with Data from NoSQL Databases..........................................100
Accessing Data from the Web.....................................................................101
Chapter 6: Conditioning Your Data . 105
Juggling between NumPy and pandas.......................................................106
Knowing when to use NumPy............................................................106
Knowing when to use pandas............................................................106
Validating Your Data....................................................................................107
Figuring out what’s in your data.......................................................108
Removing duplicates..........................................................................109
Creating a data map and data plan...................................................110
Manipulating Categorical Variables...........................................................112
Creating categorical variables..........................................................113
Renaming levels..................................................................................114
Combining levels.................................................................................115
Dealing with Dates in Your Data.................................................................116
Formatting date and time values......................................................117
Using the right time transformation.................................................117
Dealing with Missing Data...........................................................................118
Finding the missing data....................................................................119
Encoding missingness........................................................................119
Imputing missing data........................................................................120
Slicing and Dicing: Filtering and Selecting Data........................................122
Slicing rows..........................................................................................122
Slicing columns...................................................................................123
Dicing....................................................................................................123
Concatenating and Transforming...............................................................124
Adding new cases and variables.......................................................125
Removing data.....................................................................................126
Sorting and shuffling...........................................................................127
Aggregating Data at Any Level....................................................................128
Chapter 7: Shaping Data . 131
Working with HTML Pages..........................................................................132
Parsing XML and HTML.....................................................................132
Using XPath for data extraction........................................................133
Working with Raw Text................................................................................134
Dealing with Unicode.........................................................................134
Stemming and removing stop words................................................136
Introducing regular expressions.......................................................137
Using the Bag of Words Model and Beyond..............................................140
Understanding the bag of words model...........................................141
Working with n‐grams........................................................................142
Implementing TF‐IDF transformations.............................................144
Working with Graph Data............................................................................145
Understanding the adjacency matrix...............................................146
Using NetworkX basics......................................................................146
Chapter 8: Putting What You Know in Action 149
Contextualizing Problems and Data...........................................................150
Evaluating a data science problem...................................................151
Researching solutions........................................................................151
Formulating a hypothesis..................................................................152
Preparing your data............................................................................153
Considering the Art of Feature Creation...................................................153
Defining feature creation...................................................................153
Combining variables...........................................................................154
Understanding binning and discretization......................................155
Using indicator variables...................................................................155
Transforming distributions...............................................................156
Performing Operations on Arrays..............................................................156
Using vectorization.............................................................................157
Performing simple arithmetic on vectors and matrices................157
Performing matrix vector multiplication.........................................158
Performing matrix multiplication.....................................................159
Part III: Visualizing the Invisible................................ 161
Chapter 9: Getting a Crash Course in MatPlotLib 163
Starting with a Graph...................................................................................164
Defining the plot..................................................................................164
Drawing multiple lines and plots......................................................165
Saving your work................................................................................165
Setting the Axis, Ticks, Grids......................................................................166
Getting the axes..................................................................................167
Formatting the axes............................................................................167
Adding grids........................................................................................168
Defining the Line Appearance.....................................................................169
Working with line styles.....................................................................170
Using colors.........................................................................................170
Adding markers...................................................................................172
Using Labels, Annotations, and Legends...................................................173
Adding labels.......................................................................................174
Annotating the chart..........................................................................174
Creating a legend................................................................................175
Chapter 10: Visualizing the Data . 179
Choosing the Right Graph...........................................................................180
Showing parts of a whole with pie charts........................................180
Creating comparisons with bar charts............................................181
Showing distributions using histograms.........................................183
Depicting groups using box plots.....................................................184
Seeing data patterns using scatterplots..........................................185
Creating Advanced Scatterplots.................................................................187
Depicting groups.................................................................................187
Showing correlations..........................................................................188
Plotting Time Series.....................................................................................189
Representing time on axes................................................................190
Plotting trends over time...................................................................191
Plotting Geographical Data.........................................................................193
Visualizing Graphs........................................................................................195
Developing undirected graphs..........................................................195
Developing directed graphs..............................................................197
Chapter 11: Understanding the Tools . 199
Using the IPython Console..........................................................................200
Interacting with screen text..............................................................200
Changing the window appearance...................................................202
Getting Python help............................................................................203
Getting IPython help...........................................................................205
Using magic functions........................................................................205
Discovering objects............................................................................207
Using IPython Notebook..............................................................................208
Working with styles............................................................................208
Restarting the kernel..........................................................................210
Restoring a checkpoint......................................................................210
Performing Multimedia and Graphic Integration.....................................212
Embedding plots and other images..................................................212
Loading examples from online sites.................................................212
Obtaining online graphics and multimedia.....................................212
Part IV: Wrangling Data............................................ 215
Chapter 12: Stretching Python’s Capabilities . 217
Playing with Scikit‐learn..............................................................................218
Understanding classes in Scikit‐learn..............................................218
Defining applications for data science.............................................219
Performing the Hashing Trick.....................................................................222
Using hash functions..........................................................................223
Demonstrating the hashing trick......................................................223
Working with deterministic selection..............................................225
Considering Timing and Performance.......................................................227
Benchmarking with timeit.................................................................228
Working with the memory profiler...................................................230
Running in Parallel.......................................................................................232
Performing multicore parallelism.....................................................232
Demonstrating multiprocessing.......................................................233
Chapter 13: Exploring Data Analysis . 235
The EDA Approach.......................................................................................236
Defining Descriptive Statistics for Numeric Data.....................................237
Measuring central tendency..............................................................238
Measuring variance and range..........................................................239
Working with percentiles...................................................................239
Defining measures of normality........................................................240
Counting for Categorical Data.....................................................................241
Understanding frequencies...............................................................242
Creating contingency tables..............................................................243
Creating Applied Visualization for EDA.....................................................243
Inspecting boxplots............................................................................244
Performing t‐tests after boxplots......................................................245
Observing parallel coordinates.........................................................246
Graphing distributions.......................................................................247
Plotting scatterplots...........................................................................248
Understanding Correlation..........................................................................250
Using covariance and correlation.....................................................250
Using nonparametric correlation.....................................................252
Considering chi‐square for tables.....................................................253
Modifying Data Distributions......................................................................253
Using the normal distribution...........................................................254
Creating a Z‐score standardization..................................................254
Transforming other notable distributions......................................254
Chapter 14: Reducing Dimensionality . 257
Understanding SVD......................................................................................258
Looking for dimensionality reduction..............................................259
Using SVD to measure the invisible..................................................260
Performing Factor and Principal Component Analysis...........................261
Considering the psychometric model..............................................262
Looking for hidden factors................................................................262
Using components, not factors.........................................................263
Achieving dimensionality reduction................................................264
Understanding Some Applications.............................................................264
Recognizing faces with PCA..............................................................265
Extracting Topics with NMF..............................................................267
Recommending movies......................................................................270
Chapter 15: Clustering 273
Clustering with K‐means..............................................................................275
Understanding centroid‐based algorithms......................................275
Creating an example with image data..............................................277
Looking for optimal solutions...........................................................278
Clustering big data..............................................................................281
Performing Hierarchical Clustering...........................................................282
Moving Beyond the Round-Shaped Clusters: DBScan.............................286
Chapter 16: Detecting Outliers in Data 289
Considering Detection of Outliers..............................................................290
Finding more things that can go wrong...........................................291
Understanding anomalies and novel data.......................................292
Examining a Simple Univariate Method.....................................................292
Leveraging on the Gaussian distribution.........................................294
Making assumptions and checking out............................................295
Developing a Multivariate Approach.........................................................296
Using principal component analysis................................................297
Using cluster analysis.........................................................................298
Automating outliers detection with SVM.........................................299
Part V: Learning from Data........................................ 301
Chapter 17: Exploring Four Simple and Effective Algorithms . 303
Guessing the Number: Linear Regression.................................................304
Defining the family of linear models.................................................304
Using more variables..........................................................................305
Understanding limitations and problems........................................307
Moving to Logistic Regression....................................................................307
Applying logistic regression..............................................................308
Considering when classes are more.................................................309
Making Things as Simple as Naïve Bayes..................................................310
Finding out that Naïve Bayes isn’t so naïve.....................................312
Predicting text classifications...........................................................313
Learning Lazily with Nearest Neighbors....................................................315
Predicting after observing neighbors..............................................316
Choosing your k parameter wisely...................................................317
Chapter 18: Performing Cross‐Validation, Selection,
and Optimization 319
Pondering the Problem of Fitting a Model................................................320
Understanding bias and variance.....................................................321
Defining a strategy for picking models.............................................322
Dividing between training and test sets..........................................325
Cross‐Validating............................................................................................328
Using cross‐validation on k folds......................................................329
Sampling stratifications for complex data.......................................329
Selecting Variables Like a Pro.....................................................................331
Selecting by univariate measures.....................................................331
Using a greedy search........................................................................333
Pumping Up Your Hyperparameters..........................................................334
Implementing a grid search...............................................................335
Trying a randomized search.............................................................339
Chapter 19: Increasing Complexity with Linear
and Nonlinear Tricks 341
Using Nonlinear Transformations..............................................................341
Doing variable transformations........................................................342
Creating interactions between variables.........................................344
Regularizing Linear Models.........................................................................348
Relying on Ridge regression (L2)......................................................349
Using the Lasso (L1)...........................................................................349
Leveraging regularization..................................................................350
Combining L1 & L2: Elasticnet..........................................................350
Fighting with Big Data Chunk by Chunk....................................................351
Determining when there is too much data......................................351
Implementing Stochastic Gradient Descent....................................351
Understanding Support Vector Machines.................................................354
Relying on a computational method................................................355
Fixing many new parameters............................................................358
Classifying with SVC...........................................................................360
Going nonlinear is easy......................................................................365
Performing regression with SVR.......................................................366
Creating a stochastic solution with SVM.........................................368
Chapter 20: Understanding the Power of the Many 373
Starting with a Plain Decision Tree............................................................374
Understanding a decision tree..........................................................374
Creating classification and regression
trees...................................376
Making Machine Learning Accessible........................................................379
Working with a Random Forest classifier........................................381
Working with a Random Forest regressor.......................................382
Optimizing a Random Forest.............................................................383
Boosting Predictions....................................................................................384
Knowing that many weak predictors win........................................384
Creating a gradient boosting classifier............................................385
Creating a gradient boosting regressor...........................................386
Using GBM hyper‐parameters...........................................................387
Part VI: The Part of Tens............................................ 389
Chapter 21: Ten Essential Data Science
Resource Collections . 391
Gaining Insights with Data Science Weekly...............................................392
Obtaining a Resource List at U Climb Higher...........................................392
Getting a Good Start with KDnuggets........................................................392
Accessing the Huge List of Resources on Data Science Central.............393
Obtaining the Facts of Open Source Data Science from Masters...........394
Locating Free Learning Resources with Quora.........................................394
Receiving Help with Advanced Topics at Conductrics............................394
Learning New Tricks from the Aspirational Data Scientist.....................395
Finding Data Intelligence and Analytics Resources
at AnalyticBridge......................................................................................396
Zeroing In on Developer Resources with Jonathan Bower.....................396
Chapter 22: Ten Data Challenges You Should Take 397
Meeting the Data Science London + Scikit‐learn Challenge....................398
Predicting Survival on the Titanic..............................................................399
Finding a Kaggle Competition that Suits Your Needs..............................399
Honing Your Overfit Strategies...................................................................400
Trudging Through the MovieLens Dataset...............................................401
Getting Rid of Spam Emails.........................................................................401
Working with Handwritten Information.....................................................402
Working with Pictures..................................................................................403
Analyzing Amazon.com Reviews................................................................404
Interacting with a Huge Graph....................................................................405
Index........................................................................ 407



Bookscreen
e-books shop

Beyond the Book
This book isn’t the end of your Python or data science experience — it’s
really just the beginning. We provide online content to make this book more
flexible and better able to meet your needs. That way, as we receive email
from you, we can address questions and tell you how updates to either
Python or its associated add‐ons affect book content. In fact, you gain access
to all these cool additions:
✓✓Cheat sheet: You remember using crib notes in school to make a better
mark on a test, don’t you? You do? Well, a cheat sheet is sort of like
that. It provides you with some special notes about tasks that you can
do with Python, IPython, IPython Notebook, and data science that not
every other person knows. You can find the cheat sheet for this book at
It contains really neat information such as the most common programming
mistakes that cause people woe when using Python.
✓✓Dummies.com online articles: A lot of readers were skipping past the
parts pages in For Dummies books, so the publisher decided to remedy
that. You now have a really good reason to read the parts pages —
online content. Every parts page has an article associated with it that
provides additional interesting information that wouldn’t fit in the book.
You can find the articles for this book at http://www.dummies.com/
extras/pythonfordatascience.
✓✓Updates: Sometimes changes happen. For example, we might not
have seen an upcoming change when we looked into our crystal ball
during the writing of this book. In the past, this possibility simply
meant that the book became outdated and less useful, but you can now
In addition to these updates, check out the blog posts with answers to
reader questions and demonstrations of useful book‐related techniques
✓✓Companion files: Hey! Who really wants to type all the code in the book
and reconstruct all those plots manually? Most readers would prefer
to spend their time actually working with Python, performing data science
tasks, and seeing the interesting things they can do, rather than
typing. Fortunately for you, the examples used in the book are available
for download, so all you need to do is read the book to learn Python for
data science usage techniques. 
You can find these files at http://www.dummies.com/extras/matlab.
Loading...
DMCA.com Protection Status