Description
Wiley India The Data Science Handbook by Field Cady
Finding a good data scientist has been likened to hunting for a unicorn. The required combination of software engineering skills, mathematical fluency, and business savvy are simply very hard to find in one person. On top of that, good data science is not just rote application of trainable skillsets, but rather requires the ability to think critically in all these areas. This book provides a crash course in data science, combining all the necessary skills into a unified discipline. The author describes the classic machine learning algorithms, including the mathematics needed to understand what's really going on. Classical statistics is taught so that readers learn to think critically about the interpretation of data and its common pitfalls. In addition, basic software engineering and computer science skillsets often lacking in data scientists are given a central place in the book. Visualization tools are reviewed, and their central importance in data science is highlighted
About the Author
Field Cady is Senior Data Scientist at Think Big Analytics where he delivers statistically rigorous, business-relevant insights based on client data and writes production software to capitilize on these insights. He received his BA in Physics from Stanford University, an MS in Applied Mathematics from the University of Washington, and an MS in Computer Science from Carnegie Mellon University.
TABLE OF CONTENTS
Preface
1 Introduction: Becoming a Unicorn
1.1 Aren't Data Scientists Just Overpaid Statisticians?
1.2 How Is This Book Organized?
1.3 How to Use This Book?
1.4 Why Is It All in Python, Anyway?
1.5 Example Code and Datasets
1.6 Parting Words
Part I The Stuff You'll Always Use
2 The Data Science Road Map
2.1 Frame the Problem
2.2 Understand the Data: Basic Questions
2.3 Understand the Data: Data Wrangling
2.4 Understand the Data: Exploratory Analysis
2.5 Extract Features
2.6 Model
2.7 Present Results
2.8 Deploy Code
2.9 Iterating
2.10 Glossary
3 Programming Languages
3.1 Why Use a Programming Language? What Are the Other Options?
3.2 A Survey of Programming Languages for Data Science
3.3 Python Crash Course
3.4 Strings
3.5 Defining Functions
3.6 Python's Technical Libraries
3.7 Other Python Resources
3.8 Further Reading
3.9 Glossary
3a Interlude: My Personal Toolkit
4 Data Munging: String Manipulation, Regular Expressions and Data Cleaning
4.1 The Worst Dataset in the World
4.2 How to Identify Pathologies
4.3 Problems with Data Content
4.4 Formatting Issues
4.5 Example Formatting Script
4.6 Regular Expressions
4.7 Life in the Trenches
4.8 Glossary
5 Visualizations and Simple Metrics
5.1 A Note on Python's Visualization Tools
5.2 Example Code
5.3 Pie Charts
5.4 Bar Charts
5.5 Histograms
5.6 Means, Standard Deviations, Medians and Quantiles
5.7 Boxplots
5.8 Scatterplots
5.9 Scatterplots with Logarithmic Axes
5.10 Scatter Matrices
5.11 Heatmaps
5.12 Correlations
5.13 Anscombe's Quartet and the Limits of Numbers
5.14 Time Series
5.15 Further Reading
5.16 Glossary
6 Machine Learning Overview
6.1 Historical Context
6.2 Supervised versus Unsupervised
6.3 Training Data, Testing Data and the Great Boogeyman of Overfitting
6.4 Further Reading
6.5 Glossary
7 Interlude: Feature Extraction Ideas
7.1 Standard Features
7.2 Features That Involve Grouping
7.3 Preview of More Sophisticated Features
7.4 Defining the Feature You Want to Predict
8 Machine Learning Classification
8.1 What Is a Classifier and What Can You Do with It?
8.2 A Few Practical Concerns
8.3 Binary versus Multiclass
8.4 Example Script
8.5 Specific Classifiers
8.6 Evaluating Classifiers
8.7 Selecting Classification Cutoffs
8.8 Further Reading
8.9 Glossary
9 Technical Communication and Documentation
9.1 Several Guiding Principles
9.2 Slide Decks
9.3 Written Reports
9.4 Speaking: What Has Worked for Me
9.5 Code Documentation
9.6 Further Reading
9.7 Glossary
Part II Stuff You Still Need to Know
10 Unsupervised Learning: Clustering and Dimensionality Reduction
10.1 The Curse of Dimensionality
10.2 Example: Eigenfaces for Dimensionality Reduction
10.3 Principal Component Analysis and Factor Analysis
10.4 Skree Plots and Understanding Dimensionality
10.5 Factor Analysis
10.6 Limitations of PCA
10.7 Clustering
10.8 Further Reading
10.9 Glossary
11 Regression
11.1 Example: Predicting Diabetes Progression
11.2 Least Squares
11.3 Fitting Nonlinear Curves
11.4 Goodness of Fit: R2 and Correlation
11.5 Correlation of Residuals
11.6 Linear Regression
11.7 LASSO Regression and Feature Selection
11.8 Further Reading
11.9 Glossary
12 Data Encodings and File Formats
12.1 Typical File Format Categories
12.2 CSV Files
12.3 JSON Files
12.4 XML Files
12.5 HTML Files
12.6 Tar Files
12.7 GZip Files
12.8 Zip Files
12.9 Image Files: Rasterized, Vectorized, and/or Compressed
12.10 It's All Bytes at the End of the Day
12.11 Integers
12.12 Floats
12.13 Text Data
12.14 Further Reading
12.15 Glossary
13 Big Data
13.1 What Is Big Data?
13.2 Hadoop: The File System and the Processor
13.3 Using HDFS
13.4 Example PySpark Script
13.5 Spark Overview
13.6 Spark Operations
13.7 Two Ways to Run PySpark
13.8 Configuring Spark
13.9 Under the Hood
13.10 Spark Tips and Gotchas
13.11 The MapReduce Paradigm
13.12 Performance Considerations
13.13 Further Reading
13.14 Glossary
14 Databases
14.1 Relational Databases and MySQL
14.2 Key-Value Stores
14.3 Wide Column Stores
14.4 Document Stores
14.5 Further Reading
14.6 Glossary
15 Software Engineering Best Practices
15.1 Coding Style
15.2 Version Control and Git for Data Scientists
15.3 Testing Code
15.4 Test-Driven Development
15.5 AGILE Methodology
15.6 Further Reading
15.7 Glossary
16 Natural Language Processing
16.1 Do I Even Need NLP?
16.2 The Great Divide: Language versus Statistics
16.3 Example: Sentiment Analysis on Stock Market Articles
16.4 Software and Datasets
16.5 Tokenization
16.6 Central Concept: Bag of Words
16.7 Word Weighting: TFIDF
16.8 nGrams
16.9 Stop Words
16.10 Lemmatization and Stemming
16.11 Synonyms
16.12 Part of Speech Tagging
16.13 Common Problems
16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
16.15 Further Reading
16.16 Glossary
17 Time Series Analysis
17.1 Example: Predicting Wikipedia Page Views
17.2 A Typical Workflow
17.3 Time Series versus Time-Stamped Events
17.4 Resampling an Interpolation
17.5 Smoothing Signals
17.6 Logarithms and Other Transformations
17.7 Trends and Periodicity
17.8 Windowing
17.9 Brainstorming Simple Features
17.10 Better Features: Time Series as Vectors
17.11 Fourier Analysis: Sometimes a Magic Bullet
17.12 Time Series in Context: The Whole Suite of Features
17.13 Further Reading
17.14 Glossary
18 Probability 261
18.1 Flipping Coins: Bernoulli Random Variables
18.2 Throwing Darts: Uniform Random Variables
18.3 The Uniform Distribution and Pseudorandom Numbers
18.4 Nondiscrete, Noncontinuous Random Variables
18.5 Notation, Expectations, and Standard Deviation
18.6 Dependence, Marginal and Conditional Probability
18.7 Understanding the Tails
18.8 Binomial Distribution
18.9 Poisson Distribution
18.10 Normal Distribution
18.11 Multivariate Gaussian
18.12 Exponential Distribution
18.13 Log-Normal Distribution
18.14 Entropy
18.15 Further Reading
18.16 Glossary
19 Statistics
19.1 Statistics in Perspective
19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
19.3 Hypothesis Testing: Key Idea and Example
19.4 Multiple Hypothesis Testing
19.5 Parameter Estimation
19.6 Hypothesis Testing: t-Test
19.7 Confidence Intervals
19.8 Bayesian Statistics
19.9 Naive Bayesian Statistics
19.10 Bayesian Networks
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
19.12 Further Reading
19.13 Glossary
20 Programming Language Concepts
20.1 Programming Paradigms
20.2 Compilation and Interpretation
20.3 Type Systems
20.4 Further Reading
20.5 Glossary
21 Performance and Computer Memory
21.1 Example Script
21.2 Algorithm Performance and Big O Notation
21.3 Some Classic Problems: Sorting a List and Binary Search
21.4 Amortized Performance and Average Performance
21.5 Two Principles: Reducing Overhead and Managing Memory
21.6 Performance Tip: Use Numerical Libraries When Applicable
21.7 Performance Tip: Delete Large Structures You Don't Need
21.8 Performance Tip: Use Built In Functions When Possible
21.9 Performance Tip: Avoid Superfluous Function Calls
21.10 Performance Tip: Avoid Creating Large New Objects
21.11 Further Reading
21.12 Glossary
Part III Specialized or Advanced Topics
22 Computer Memory and Data Structures
22.1 Virtual Memory, the Stack, and the Heap
22.2 Example C Program
22.3 Data Types and Arrays in Memory
22.4 Structs
22.5 Pointers, the Stack, and the Heap
22.6 Key Data Structures
22.7 Further Reading
22.8 Glossary
23 Maximum Likelihood Estimation and Optimization
23.1 Maximum Likelihood Estimation
23.2 A Simple Example: Fitting a Line
23.3 Another Example: Logistic Regression
23.4 Optimization
23.5 Gradient Descent and Convex Optimization
23.6 Convex Optimization
23.7 Stochastic Gradient Descent
23.8 Further Reading
23.9 Glossary
24 Advanced Classifiers
24.1 A Note on Libraries
24.2 Basic Deep Learning
24.3 Convolutional Neural Networks
24.4 Different Types of Layers. What the Heck Is a Tensor?
24.5 Example: The MNIST Handwriting Dataset
24.6 Recurrent Neural Networks
24.7 Bayesian Networks
24.8 Training and Prediction
24.9 Markov Chain Monte Carlo
24.10 PyMC Example
24.11 Further Reading
24.12 Glossary
25 Stochastic Modeling
25.1 Markov Chains
25.2 Two Kinds of Markov Chain, Two Kinds of Questions
25.3 Markov Chain Monte Carlo
25.4 Hidden Markov Models and the Viterbi Algorithm
25.5 The Viterbi Algorithm
25.6 Random Walks
25.7 Brownian Motion
25.8 ARIMA Models
25.9 Continuous Time Markov Processes
25.10 Poisson Processes
25.11 Further Reading
25.12 Glossary
25a Parting Words: Your Future as a Data Scientist
Index