Description

Finding a good data scientist has been likened to hunting for a unicorn. The required combination of software engineering skills, mathematical fluency, and business savvy are simply very hard to find in one person. On top of that, good data science is not just rote application of trainable skillsets, but rather requires the ability to think critically in all these areas. This book provides a crash course in data science, combining all the necessary skills into a unified discipline. The author describes the classic machine learning algorithms, including the mathematics needed to understand what's really going on. Classical statistics is taught so that readers learn to think critically about the interpretation of data and its common pitfalls. In addition, basic software engineering and computer science skillsets often lacking in data scientists are given a central place in the book. Visualization tools are reviewed, and their central importance in data science is highlighted About the Author Field Cady is Senior Data Scientist at Think Big Analytics where he delivers statistically rigorous, business-relevant insights based on client data and writes production software to capitilize on these insights. He received his BA in Physics from Stanford University, an MS in Applied Mathematics from the University of Washington, and an MS in Computer Science from Carnegie Mellon University. TABLE OF CONTENTS Preface 1 Introduction: Becoming a Unicorn 1.1 Aren't Data Scientists Just Overpaid Statisticians? 1.2 How Is This Book Organized? 1.3 How to Use This Book? 1.4 Why Is It All in Python, Anyway? 1.5 Example Code and Datasets 1.6 Parting Words Part I The Stuff You'll Always Use 2 The Data Science Road Map 2.1 Frame the Problem 2.2 Understand the Data: Basic Questions 2.3 Understand the Data: Data Wrangling 2.4 Understand the Data: Exploratory Analysis 2.5 Extract Features 2.6 Model 2.7 Present Results 2.8 Deploy Code 2.9 Iterating 2.10 Glossary 3 Programming Languages 3.1 Why Use a Programming Language? What Are the Other Options? 3.2 A Survey of Programming Languages for Data Science 3.3 Python Crash Course 3.4 Strings 3.5 Defining Functions 3.6 Python's Technical Libraries 3.7 Other Python Resources 3.8 Further Reading 3.9 Glossary 3a Interlude: My Personal Toolkit 4 Data Munging: String Manipulation, Regular Expressions and Data Cleaning 4.1 The Worst Dataset in the World 4.2 How to Identify Pathologies 4.3 Problems with Data Content 4.4 Formatting Issues 4.5 Example Formatting Script 4.6 Regular Expressions 4.7 Life in the Trenches 4.8 Glossary 5 Visualizations and Simple Metrics 5.1 A Note on Python's Visualization Tools 5.2 Example Code 5.3 Pie Charts 5.4 Bar Charts 5.5 Histograms 5.6 Means, Standard Deviations, Medians and Quantiles 5.7 Boxplots 5.8 Scatterplots 5.9 Scatterplots with Logarithmic Axes 5.10 Scatter Matrices 5.11 Heatmaps 5.12 Correlations 5.13 Anscombe's Quartet and the Limits of Numbers 5.14 Time Series 5.15 Further Reading 5.16 Glossary 6 Machine Learning Overview 6.1 Historical Context 6.2 Supervised versus Unsupervised 6.3 Training Data, Testing Data and the Great Boogeyman of Overfitting 6.4 Further Reading 6.5 Glossary 7 Interlude: Feature Extraction Ideas 7.1 Standard Features 7.2 Features That Involve Grouping 7.3 Preview of More Sophisticated Features 7.4 Defining the Feature You Want to Predict 8 Machine Learning Classification 8.1 What Is a Classifier and What Can You Do with It? 8.2 A Few Practical Concerns 8.3 Binary versus Multiclass 8.4 Example Script 8.5 Specific Classifiers 8.6 Evaluating Classifiers 8.7 Selecting Classification Cutoffs 8.8 Further Reading 8.9 Glossary 9 Technical Communication and Documentation 9.1 Several Guiding Principles 9.2 Slide Decks 9.3 Written Reports 9.4 Speaking: What Has Worked for Me 9.5 Code Documentation 9.6 Further Reading 9.7 Glossary Part II Stuff You Still Need to Know 10 Unsupervised Learning: Clustering and Dimensionality Reduction 10.1 The Curse of Dimensionality 10.2 Example: Eigenfaces for Dimensionality Reduction 10.3 Principal Component Analysis and Factor Analysis 10.4 Skree Plots and Understanding Dimensionality 10.5 Factor Analysis 10.6 Limitations of PCA 10.7 Clustering 10.8 Further Reading 10.9 Glossary 11 Regression 11.1 Example: Predicting Diabetes Progression 11.2 Least Squares 11.3 Fitting Nonlinear Curves 11.4 Goodness of Fit: R2 and Correlation 11.5 Correlation of Residuals 11.6 Linear Regression 11.7 LASSO Regression and Feature Selection 11.8 Further Reading 11.9 Glossary 12 Data Encodings and File Formats 12.1 Typical File Format Categories 12.2 CSV Files 12.3 JSON Files 12.4 XML Files 12.5 HTML Files 12.6 Tar Files 12.7 GZip Files 12.8 Zip Files 12.9 Image Files: Rasterized, Vectorized, and/or Compressed 12.10 It's All Bytes at the End of the Day 12.11 Integers 12.12 Floats 12.13 Text Data 12.14 Further Reading 12.15 Glossary 13 Big Data 13.1 What Is Big Data? 13.2 Hadoop: The File System and the Processor 13.3 Using HDFS 13.4 Example PySpark Script 13.5 Spark Overview 13.6 Spark Operations 13.7 Two Ways to Run PySpark 13.8 Configuring Spark 13.9 Under the Hood 13.10 Spark Tips and Gotchas 13.11 The MapReduce Paradigm 13.12 Performance Considerations 13.13 Further Reading 13.14 Glossary 14 Databases 14.1 Relational Databases and MySQL 14.2 Key-Value Stores 14.3 Wide Column Stores 14.4 Document Stores 14.5 Further Reading 14.6 Glossary 15 Software Engineering Best Practices 15.1 Coding Style 15.2 Version Control and Git for Data Scientists 15.3 Testing Code 15.4 Test-Driven Development 15.5 AGILE Methodology 15.6 Further Reading 15.7 Glossary 16 Natural Language Processing 16.1 Do I Even Need NLP? 16.2 The Great Divide: Language versus Statistics 16.3 Example: Sentiment Analysis on Stock Market Articles 16.4 Software and Datasets 16.5 Tokenization 16.6 Central Concept: Bag of Words 16.7 Word Weighting: TFIDF 16.8 nGrams 16.9 Stop Words 16.10 Lemmatization and Stemming 16.11 Synonyms 16.12 Part of Speech Tagging 16.13 Common Problems 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding 16.15 Further Reading 16.16 Glossary 17 Time Series Analysis 17.1 Example: Predicting Wikipedia Page Views 17.2 A Typical Workflow 17.3 Time Series versus Time-Stamped Events 17.4 Resampling an Interpolation 17.5 Smoothing Signals 17.6 Logarithms and Other Transformations 17.7 Trends and Periodicity 17.8 Windowing 17.9 Brainstorming Simple Features 17.10 Better Features: Time Series as Vectors 17.11 Fourier Analysis: Sometimes a Magic Bullet 17.12 Time Series in Context: The Whole Suite of Features 17.13 Further Reading 17.14 Glossary 18 Probability 261 18.1 Flipping Coins: Bernoulli Random Variables 18.2 Throwing Darts: Uniform Random Variables 18.3 The Uniform Distribution and Pseudorandom Numbers 18.4 Nondiscrete, Noncontinuous Random Variables 18.5 Notation, Expectations, and Standard Deviation 18.6 Dependence, Marginal and Conditional Probability 18.7 Understanding the Tails 18.8 Binomial Distribution 18.9 Poisson Distribution 18.10 Normal Distribution 18.11 Multivariate Gaussian 18.12 Exponential Distribution 18.13 Log-Normal Distribution 18.14 Entropy 18.15 Further Reading 18.16 Glossary 19 Statistics 19.1 Statistics in Perspective 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies 19.3 Hypothesis Testing: Key Idea and Example 19.4 Multiple Hypothesis Testing 19.5 Parameter Estimation 19.6 Hypothesis Testing: t-Test 19.7 Confidence Intervals 19.8 Bayesian Statistics 19.9 Naive Bayesian Statistics 19.10 Bayesian Networks 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 19.12 Further Reading 19.13 Glossary 20 Programming Language Concepts 20.1 Programming Paradigms 20.2 Compilation and Interpretation 20.3 Type Systems 20.4 Further Reading 20.5 Glossary 21 Performance and Computer Memory 21.1 Example Script 21.2 Algorithm Performance and Big O Notation 21.3 Some Classic Problems: Sorting a List and Binary Search 21.4 Amortized Performance and Average Performance 21.5 Two Principles: Reducing Overhead and Managing Memory 21.6 Performance Tip: Use Numerical Libraries When Applicable 21.7 Performance Tip: Delete Large Structures You Don't Need 21.8 Performance Tip: Use Built In Functions When Possible 21.9 Performance Tip: Avoid Superfluous Function Calls 21.10 Performance Tip: Avoid Creating Large New Objects 21.11 Further Reading 21.12 Glossary Part III Specialized or Advanced Topics 22 Computer Memory and Data Structures 22.1 Virtual Memory, the Stack, and the Heap 22.2 Example C Program 22.3 Data Types and Arrays in Memory 22.4 Structs 22.5 Pointers, the Stack, and the Heap 22.6 Key Data Structures 22.7 Further Reading 22.8 Glossary 23 Maximum Likelihood Estimation and Optimization 23.1 Maximum Likelihood Estimation 23.2 A Simple Example: Fitting a Line 23.3 Another Example: Logistic Regression 23.4 Optimization 23.5 Gradient Descent and Convex Optimization 23.6 Convex Optimization 23.7 Stochastic Gradient Descent 23.8 Further Reading 23.9 Glossary 24 Advanced Classifiers 24.1 A Note on Libraries 24.2 Basic Deep Learning 24.3 Convolutional Neural Networks 24.4 Different Types of Layers. What the Heck Is a Tensor? 24.5 Example: The MNIST Handwriting Dataset 24.6 Recurrent Neural Networks 24.7 Bayesian Networks 24.8 Training and Prediction 24.9 Markov Chain Monte Carlo 24.10 PyMC Example 24.11 Further Reading 24.12 Glossary 25 Stochastic Modeling 25.1 Markov Chains 25.2 Two Kinds of Markov Chain, Two Kinds of Questions 25.3 Markov Chain Monte Carlo 25.4 Hidden Markov Models and the Viterbi Algorithm 25.5 The Viterbi Algorithm 25.6 Random Walks 25.7 Brownian Motion 25.8 ARIMA Models 25.9 Continuous Time Markov Processes 25.10 Poisson Processes 25.11 Further Reading 25.12 Glossary 25a Parting Words: Your Future as a Data Scientist Index

More Details about The Data Science Handbook

General Information  
Author(s)Field Cady
PublisherWiley India
ISBN9788126573332
Pages416
BindingPaperback
LanguageEnglish
Publish YearNovember 2018