Produktbild: The Data Science Handbook

The Data Science Handbook

Fr. 102.00

inkl. gesetzl. MwSt., Versandkostenfrei


Produktdetails

Einband

Gebundene Ausgabe

Erscheinungsdatum

05.12.2024

Verlag

Wiley

Seitenzahl

368

Maße (L/B/H)

26.1/18.5/2.7 cm

Gewicht

826 g

Auflage

2nd edition

Sprache

Englisch

ISBN

978-1-394-23449-3

Produktdetails

Einband

Gebundene Ausgabe

Erscheinungsdatum

05.12.2024

Verlag

Wiley

Seitenzahl

368

Maße (L/B/H)

26.1/18.5/2.7 cm

Gewicht

826 g

Auflage

2nd edition

Sprache

Englisch

ISBN

978-1-394-23449-3

Herstelleradresse

Libri GmbH
Europaallee 1
36244 Bad Hersfeld
DE

Email: gpsr@libri.de

Kundinnen und Kunden meinen

0 Bewertungen

Informationen zu Bewertungen

Zur Abgabe einer Bewertung ist eine Anmeldung im Konto notwendig. Die Authentizität der Bewertungen wird von uns nicht überprüft. Wir behalten uns vor, Bewertungstexte, die unseren Richtlinien widersprechen, entsprechend zu kürzen oder zu löschen.

Die Bewertungen sind nach Format, Anzahl Sterne und Datum sortiert.

Verfassen Sie die erste Bewertung zu diesem Artikel

Helfen Sie anderen Kund*innen durch Ihre Meinung

Kundinnen und Kunden meinen

0 Bewertungen filtern

Die Leseprobe wird geladen.
  • Produktbild: The Data Science Handbook
  • Preface to the First Edition xvii

    Preface to the Second Edition xix

    1 Introduction 1

    1.1 What Data Science Is and Isn't 2

    1.2 This Book's Slogan: Simple Models Are Easier to Work With 3

    1.3 How Is This Book Organized? 4

    1.4 How to Use This Book? 4

    1.5 Why Is It All in Python, Anyway? 4

    1.6 Example Code and Datasets 5

    1.7 Parting Words 5

    Part I The Stuff You'll Always Use 7

    2 The Data Science Road Map 9

    2.1 Frame the Problem 10

    2.2 Understand the Data: Basic Questions 11

    2.3 Understand the Data: Data Wrangling 12

    2.4 Understand the Data: Exploratory Analysis 12

    2.5 Extract Features 13

    2.6 Model 14

    2.7 Present Results 14

    2.8 Deploy Code 14

    2.9 Iterating 15

    2.10 Glossary 15

    3 Programming Languages 17

    3.1 Why Use a Programming Language? What Are the Other Options? 17

    3.2 A Survey of Programming Languages for Data Science 18

    3.3 Where to Write Code 20

    3.4 Python Overview and Example Scripts 21

    3.5 Python Data Types 25

    3.6 GOTCHA: Hashable and Unhashable Types 30

    3.7 Functions and Control Structures 31

    3.8 Other Parts of Python 33

    3.9 Python's Technical Libraries 35

    3.10 Other Python Resources 39

    3.11 Further Reading 39

    3.12 Glossary 40

    3a Interlude: My Personal Toolkit 41

    4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 43

    4.1 The Worst Dataset in the World 43

    4.2 How to Identify Pathologies 44

    4.3 Problems with Data Content 44

    4.4 Formatting Issues 46

    4.5 Example Formatting Script 49

    4.6 Regular Expressions 50

    4.7 Life in the Trenches 53

    4.8 Glossary 54

    5 Visualizations and Simple Metrics 55

    5.1 A Note on Python's Visualization Tools 56

    5.2 Example Code 56

    5.3 Pie Charts 56

    5.4 Bar Charts 58

    5.5 Histograms 59

    5.6 Means, Standard Deviations, Medians, and Quantiles 61

    5.7 Boxplots 62

    5.8 Scatterplots 64

    5.9 Scatterplots with Logarithmic Axes 65

    5.10 Scatter Matrices 67

    5.11 Heatmaps 68

    5.12 Correlations 69

    5.13 Anscombe's Quartet and the Limits of Numbers 71

    5.14 Time Series 72

    5.15 Further Reading 75

    5.16 Glossary 75

    6 Overview: Machine Learning and Artificial Intelligence 77

    6.1 Historical Context 77

    6.2 The Central Paradigm: Learning a Function from Example 78

    6.3 Machine Learning Data: Vectors and Feature Extraction 79

    6.4 Supervised, Unsupervised, and In-Between 79

    6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting 80

    6.6 Reinforcement Learning 81

    6.7 ML Models as Building Blocks for AI Systems 82

    6.8 ML Engineering as a New Job Role 82

    6.9 Further Reading 83

    6.10 Glossary 83

    7 Interlude: Feature Extraction Ideas 85

    7.1 Standard Features 85

    7.2 Features that Involve Grouping 86

    7.3 Preview of More Sophisticated Features 86

    7.4 You Get What You Measure: Defining the Target Variable 87

    8 Machine-Learning Classification 89

    8.1 What Is a Classifier, and What Can You Do with It? 89

    8.2 A Few Practical Concerns 90

    8.3 Binary Versus Multiclass 90

    8.4 Example Script 91

    8.5 Specific Classifiers 92

    8.6 Evaluating Classifiers 102

    8.7 Selecting Classification Cutoffs 105

    8.8 Further Reading 106

    8.9 Glossary 106

    9 Technical Communication and Documentation 109

    9.1 Several Guiding Principles 109

    9.2 Slide Decks 112

    9.3 Written Reports 114

    9.4 Speaking: What Has Worked for Me 115

    9.5 Code Documentation 117

    9.6 Further Reading 117

    9.7 Glossary 117

    Part II Stuff You Still Need to Know 119

    10 Unsupervised Learning: Clustering and Dimensionality Reduction 121

    10.1 The Curse of Dimensionality 121

    10.2 Example: Eigenfaces for Dimensionality Reduction 123

    10.3 Principal Component Analysis and Factor Analysis 125

    10.4 Skree Plots and Understanding Dimensionality 127

    10.5 Factor Analysis 127

    10.6 Limitations of PCA 128

    10.7 Clustering 128

    10.8 Further Reading 133

    10.9 Glossary 134

    11 Regression 135

    11.1 Example: Predicting Diabetes Progression 136

    11.2 Fitting a Line with Least Squares 137

    11.3 Alternatives to Least Squares 139

    11.4 Fitting Nonlinear Curves 139

    11.5 Goodness of Fit: R 2 and Correlation 141

    11.6 Correlation of Residuals 142

    11.7 Linear Regression 142

    11.8 LASSO Regression and Feature Selection 144

    11.9 Further Reading 145

    11.10 Glossary 145

    12 Data Encodings and File Formats 147

    12.1 Typical File Format Categories 147

    12.2 CSV Files 149

    12.3 JSON Files 150

    12.4 XML Files 151

    12.5 HTML Files 153

    12.6 Tar Files 154

    12.7 GZip Files 155

    12.8 Zip Files 155

    12.9 Image Files: Rasterized, Vectorized, and/or Compressed 156

    12.10 It's All Bytes at the End of the Day 157

    12.11 Integers 158

    12.12 Floats 158

    12.13 Text Data 159

    12.14 Further Reading 161

    12.15 Glossary 161

    13 Big Data 163

    13.1 What Is Big Data? 163

    13.2 When to Use - And not Use - Big Data 164

    13.3 Hadoop: The File System and the Processor 165

    13.4 Example PySpark Script 165

    13.5 Spark Overview 166

    13.6 Spark Operations 168

    13.7 PySpark Data Frames 169

    13.8 Two Ways to Run PySpark 170

    13.9 Configuring Spark 170

    13.10 Under the Hood 172

    13.11 Spark Tips and Gotchas 172

    13.12 The MapReduce Paradigm 173

    13.13 Performance Considerations 174

    13.14 Further Reading 175

    13.15 Glossary 176

    14 Databases 177

    14.1 Relational Databases and MySQL® 178

    14.2 Key-Value Stores 183

    14.3 Wide-Column Stores 183

    14.4 Document Stores 184

    14.5 Further Reading 186

    14.6 Glossary 186

    15 Software Engineering Best Practices 187

    15.1 Coding Style 187

    15.2 Version Control and Git for Data Scientists 189

    15.3 Testing Code 191

    15.4 Test-Driven Development 193

    15.5 AGILE Methodology 194

    15.6 Further Reading 194

    15.7 Glossary 194

    16 Traditional Natural Language Processing 197

    16.1 Do I Even Need NLP? 197

    16.2 The Great Divide: Language Versus Statistics 198

    16.3 Example: Sentiment Analysis on Stock Market Articles 198

    16.4 Software and Datasets 200

    16.5 Tokenization 201

    16.6 Central Concept: Bag-of-Words 201

    16.7 Word Weighting: TF-IDF 202

    16.8 n-Grams 202

    16.9 Stop Words 203

    16.10 Lemmatization and Stemming 203

    16.11 Synonyms 204

    16.12 Part of Speech Tagging 204

    16.13 Common Problems 204

    16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding 206

    16.15 Further Reading 207

    16.16 Glossary 207

    17 Time Series Analysis 209

    17.1 Example: Predicting Wikipedia Page Views 210

    17.2 A Typical Workflow 213

    17.3 Time Series Versus Time-Stamped Events 213

    17.4 Resampling and Interpolation 214

    17.5 Smoothing Signals 216

    17.6 Logarithms and Other Transformations 217

    17.7 Trends and Periodicity 217

    17.8 Windowing 217

    17.9 Brainstorming Simple Features 218

    17.10 Better Features: Time Series as Vectors 219

    17.11 Fourier Analysis: Sometimes a Magic Bullet 220

    17.12 Time Series in Context: The Whole Suite of Features 222

    17.13 Further Reading 222

    17.14 Glossary 222

    18 Probability 225

    18.1 Flipping Coins: Bernoulli Random Variables 225

    18.2 Throwing Darts: Uniform Random Variables 226

    18.3 The Uniform Distribution and Pseudorandom Numbers 227

    18.4 Nondiscrete, Noncontinuous Random Variables 228

    18.5 Notation, Expectations, and Standard Deviation 230

    18.6 Dependence, Marginal, and Conditional Probability 231

    18.7 Understanding the Tails 232

    18.8 Binomial Distribution 234

    18.9 Poisson Distribution 234

    18.10 Normal Distribution 235

    18.11 Multivariate Gaussian 236

    18.12 Exponential Distribution 237

    18.13 Log-Normal Distribution 238

    18.14 Entropy 238

    18.15 Further Reading 240

    18.16 Glossary 240

    19 Statistics 243

    19.1 Statistics in Perspective 243

    19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies 244

    19.3 Hypothesis Testing: Key Idea and Example 245

    19.4 Multiple Hypothesis Testing 246

    19.5 Parameter Estimation 247

    19.6 Hypothesis Testing: t-Test 248

    19.7 Confidence Intervals 250

    19.8 Bayesian Statistics 252

    19.9 Naive Bayesian Statistics 253

    19.10 Bayesian Networks 253

    19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 254

    19.12 Further Reading 255

    19.13 Glossary 255

    20 Programming Language Concepts 257

    20.1 Programming Paradigms 257

    20.2 Compilation and Interpretation 264

    20.3 Type Systems 266

    20.4 Further Reading 267

    20.5 Glossary 267

    21 Performance and Computer Memory 269

    21.1 A Word of Caution 269

    21.2 Example Script 270

    21.3 Algorithm Performance and Big-O Notation 272

    21.4 Some Classic Problems: Sorting a List and Binary Search 273

    21.5 Amortized Performance and Average Performance 276

    21.6 Two Principles: Reducing Overhead and Managing Memory 277

    21.7 Performance Tip: Use Numerical Libraries When Applicable 278

    21.8 Performance Tip: Delete Large Structures You Don't Need 280

    21.9 Performance Tip: Use Built-In Functions When Possible 280

    21.10 Performance Tip: Avoid Superfluous Function Calls 280

    21.11 Performance Tip: Avoid Creating Large New Objects 281

    21.12 Further Reading 281

    21.13 Glossary 281

    Part III Specialized or Advanced Topics 283

    22 Computer Memory and Data Structures 285

    22.1 Virtual Memory, the Stack, and the Heap 285

    22.2 Example C Program 286

    22.3 Data Types and Arrays in Memory 286

    22.4 Structs 287

    22.5 Pointers, the Stack, and the Heap 288

    22.6 Key Data Structures 292

    22.7 Further Reading 297

    22.8 Glossary 297

    23 Maximum-Likelihood Estimation and Optimization 299

    23.1 Maximum-Likelihood Estimation 299

    23.2 A Simple Example: Fitting a Line 300

    23.3 Another Example: Logistic Regression 301

    23.4 Optimization 302

    23.5 Gradient Descent 303

    23.6 Convex Optimization 306

    23.7 Stochastic Gradient Descent 307

    23.8 Further Reading 308

    23.9 Glossary 308

    24 Deep Learning and AI 309

    24.1 A Note on Libraries and Hardware 310

    24.2 A Note on Training Data 310

    24.3 Simple Deep Learning: Perceptrons 311

    24.4 What Is a Tensor? 314

    24.5 Convolutional Neural Networks 315

    24.6 Example: The MNIST Handwriting Dataset 317

    24.7 Autoencoders and Latent Vectors 318

    24.8 Generative AI and GANs 321

    24.9 Diffusion Models 323

    24.10 RNNs, Hidden State, and the Encoder-Decoder 324

    24.11 Attention and Transformers 325

    24.12 Stable Diffusion: Bringing the Parts Together 326

    24.13 Large Language Models and Prompt Engineering 327

    24.14 Further Reading 328

    24.15 Glossary 329

    25 Stochastic Modeling 331

    25.1 Markov Chains 331

    25.2 Two Kinds of Markov Chain, Two Kinds of Questions 333

    25.3 Hidden Markov Models and the Viterbi Algorithm 334

    25.4 The Viterbi Algorithm 336

    25.5 Random Walks 337

    25.6 Brownian Motion 338

    25.7 ARIMA Models 339

    25.8 Continuous-Time Markov Processes 339

    25.9 Poisson Processes 340

    25.10 Further Reading 341

    25.11 Glossary 341

    26 Parting Words: Your Future as a Data Scientist 343

    Index 345