Statistics for Data Science

  1. Regression Analysis for Prediction
    1. Correlation Analysis
      1. Correlation vs. Causation
        1. Fundamental Distinction
          1. Association vs. Causal Relationship
            1. Third Variable Problem
              1. Temporal Sequence Importance
              2. Common Fallacies
                1. Post Hoc Ergo Propter Hoc
                  1. Confounding Variables
                    1. Reverse Causation
                    2. Establishing Causation
                      1. Experimental Design
                        1. Longitudinal Studies
                          1. Causal Inference Methods
                        2. Pearson Correlation Coefficient
                          1. Definition and Formula
                            1. Covariance Standardization
                              1. Range [-1, 1]
                                1. Linear Relationship Measure
                                2. Calculation Methods
                                  1. Computational Formula
                                    1. Definitional Formula
                                      1. Software Implementation
                                      2. Interpretation Guidelines
                                        1. Strength Categories
                                          1. Direction (Positive/Negative)
                                            1. Perfect Correlation Cases
                                            2. Assumptions
                                              1. Linear Relationship
                                                1. Bivariate Normality
                                                  1. Continuous Variables
                                                  2. Limitations
                                                    1. Non-Linear Relationships
                                                      1. Outlier Sensitivity
                                                        1. Restriction of Range
                                                      2. Other Correlation Measures
                                                        1. Spearman's Rank Correlation
                                                          1. Kendall's Tau
                                                            1. Point-Biserial Correlation
                                                            2. Correlation Matrix
                                                              1. Construction and Interpretation
                                                                1. Multicollinearity Detection
                                                                  1. Visualization Methods
                                                                2. Simple Linear Regression
                                                                  1. Model Formulation
                                                                    1. Population Regression Model
                                                                      1. y = β₀ + β₁x + ε
                                                                        1. Parameter Interpretation
                                                                          1. Error Term Assumptions
                                                                          2. Sample Regression Model
                                                                            1. ŷ = b₀ + b₁x
                                                                              1. Fitted Values vs. Residuals
                                                                                1. Prediction Equation
                                                                              2. Method of Least Squares
                                                                                1. Principle of Least Squares
                                                                                  1. Minimizing Sum of Squared Residuals
                                                                                    1. Optimization Objective
                                                                                    2. Normal Equations
                                                                                      1. Derivative-Based Solution
                                                                                        1. Closed-Form Formulas
                                                                                        2. Coefficient Estimation
                                                                                          1. Slope Estimate (b₁)
                                                                                            1. Intercept Estimate (b₀)
                                                                                              1. Computational Formulas
                                                                                            2. Interpretation of Coefficients
                                                                                              1. Slope (β₁)
                                                                                                1. Rate of Change Interpretation
                                                                                                  1. Units and Scaling
                                                                                                    1. Practical Meaning
                                                                                                    2. Intercept (β₀)
                                                                                                      1. Y-value When X=0
                                                                                                        1. Extrapolation Concerns
                                                                                                          1. Centering for Interpretation
                                                                                                        2. Assumptions of Linear Regression
                                                                                                          1. Linearity
                                                                                                            1. Linear Relationship Between Variables
                                                                                                              1. Functional Form Specification
                                                                                                              2. Independence
                                                                                                                1. Independent Observations
                                                                                                                  1. No Autocorrelation
                                                                                                                  2. Homoscedasticity
                                                                                                                    1. Constant Error Variance
                                                                                                                      1. Equal Spread Across X Values
                                                                                                                      2. Normality
                                                                                                                        1. Normal Distribution of Errors
                                                                                                                          1. Importance for Inference
                                                                                                                          2. No Perfect Multicollinearity
                                                                                                                            1. Relevant for Multiple Regression
                                                                                                                          3. Inference in Simple Linear Regression
                                                                                                                            1. Standard Errors
                                                                                                                              1. Sampling Variability
                                                                                                                                1. Confidence Intervals
                                                                                                                                2. Hypothesis Testing
                                                                                                                                  1. Testing Slope Significance
                                                                                                                                    1. t-tests for Coefficients
                                                                                                                                    2. Confidence and Prediction Intervals
                                                                                                                                      1. Confidence Intervals for Mean Response
                                                                                                                                        1. Prediction Intervals for Individual Values
                                                                                                                                          1. Width Differences
                                                                                                                                      2. Multiple Linear Regression
                                                                                                                                        1. Model Extension
                                                                                                                                          1. Multiple Predictor Variables
                                                                                                                                            1. y = β₀ + β₁x₁ + β₂x₂ + ... + βₖxₖ + ε
                                                                                                                                              1. Partial Regression Coefficients
                                                                                                                                                1. Holding Other Variables Constant
                                                                                                                                              2. Matrix Formulation
                                                                                                                                                1. Design Matrix
                                                                                                                                                  1. Vector Notation
                                                                                                                                                    1. Normal Equations in Matrix Form
                                                                                                                                                    2. Coefficient Interpretation
                                                                                                                                                      1. Partial Effects
                                                                                                                                                        1. Ceteris Paribus Interpretation
                                                                                                                                                          1. Controlling for Other Variables
                                                                                                                                                          2. Multicollinearity Impact
                                                                                                                                                            1. Coefficient Instability
                                                                                                                                                              1. Standard Error Inflation
                                                                                                                                                            2. Handling Categorical Predictors
                                                                                                                                                              1. Dummy Variables
                                                                                                                                                                1. Binary Encoding
                                                                                                                                                                  1. Reference Category
                                                                                                                                                                    1. Interpretation of Dummy Coefficients
                                                                                                                                                                    2. Multiple Categories
                                                                                                                                                                      1. k-1 Dummy Variables for k Categories
                                                                                                                                                                        1. Avoiding Dummy Variable Trap
                                                                                                                                                                        2. Interaction Terms
                                                                                                                                                                          1. Product Terms
                                                                                                                                                                            1. Moderation Effects
                                                                                                                                                                              1. Interpretation Complexity
                                                                                                                                                                            2. Model Selection Techniques
                                                                                                                                                                              1. Forward Selection
                                                                                                                                                                                1. Starting with No Variables
                                                                                                                                                                                  1. Adding Variables Sequentially
                                                                                                                                                                                    1. Stopping Criteria
                                                                                                                                                                                    2. Backward Elimination
                                                                                                                                                                                      1. Starting with All Variables
                                                                                                                                                                                        1. Removing Variables Sequentially
                                                                                                                                                                                          1. Significance Thresholds
                                                                                                                                                                                          2. Stepwise Selection
                                                                                                                                                                                            1. Combination of Forward and Backward
                                                                                                                                                                                              1. Bidirectional Process
                                                                                                                                                                                              2. Best Subsets Regression
                                                                                                                                                                                                1. Information Criteria
                                                                                                                                                                                                2. Regularization Methods
                                                                                                                                                                                                  1. Ridge Regression Preview
                                                                                                                                                                                                    1. Lasso Regression Preview
                                                                                                                                                                                                      1. Elastic Net Preview
                                                                                                                                                                                                  2. Model Evaluation and Diagnostics
                                                                                                                                                                                                    1. Goodness of Fit Measures
                                                                                                                                                                                                      1. R-squared (Coefficient of Determination)
                                                                                                                                                                                                        1. Proportion of Variance Explained
                                                                                                                                                                                                          1. Calculation and Interpretation
                                                                                                                                                                                                            1. Range [0, 1]
                                                                                                                                                                                                              1. Limitations and Misconceptions
                                                                                                                                                                                                              2. Adjusted R-squared
                                                                                                                                                                                                                1. Penalty for Additional Variables
                                                                                                                                                                                                                  1. Model Comparison Tool
                                                                                                                                                                                                                    1. Formula and Interpretation
                                                                                                                                                                                                                      1. When to Use vs. R-squared
                                                                                                                                                                                                                      2. Root Mean Squared Error (RMSE)
                                                                                                                                                                                                                        1. Prediction Error Measure
                                                                                                                                                                                                                          1. Units of Original Variable
                                                                                                                                                                                                                            1. Calculation and Interpretation
                                                                                                                                                                                                                              1. Comparison Across Models
                                                                                                                                                                                                                              2. Mean Absolute Error (MAE)
                                                                                                                                                                                                                                1. Alternative Error Measure
                                                                                                                                                                                                                                  1. Robustness to Outliers
                                                                                                                                                                                                                                    1. Interpretation Advantages
                                                                                                                                                                                                                                  2. Residual Analysis
                                                                                                                                                                                                                                    1. Residual Definition
                                                                                                                                                                                                                                      1. Observed vs. Predicted Values
                                                                                                                                                                                                                                        1. Error Term Estimates
                                                                                                                                                                                                                                          1. Types of Residuals
                                                                                                                                                                                                                                          2. Checking Linearity
                                                                                                                                                                                                                                            1. Residuals vs. Fitted Values Plot
                                                                                                                                                                                                                                              1. Patterns Indicating Non-linearity
                                                                                                                                                                                                                                                1. Transformation Considerations
                                                                                                                                                                                                                                                2. Checking Homoscedasticity
                                                                                                                                                                                                                                                  1. Constant Variance Assessment
                                                                                                                                                                                                                                                    1. Fan-Shaped Patterns
                                                                                                                                                                                                                                                      1. Breusch-Pagan Test
                                                                                                                                                                                                                                                        1. White Test
                                                                                                                                                                                                                                                        2. Checking Normality of Residuals
                                                                                                                                                                                                                                                          1. Q-Q Plots
                                                                                                                                                                                                                                                            1. Histogram of Residuals
                                                                                                                                                                                                                                                              1. Shapiro-Wilk Test
                                                                                                                                                                                                                                                                1. Kolmogorov-Smirnov Test
                                                                                                                                                                                                                                                                2. Checking Independence
                                                                                                                                                                                                                                                                  1. Durbin-Watson Test
                                                                                                                                                                                                                                                                    1. Autocorrelation Plots
                                                                                                                                                                                                                                                                      1. Time Series Considerations
                                                                                                                                                                                                                                                                      2. Identifying Influential Points
                                                                                                                                                                                                                                                                        1. Outliers vs. Influential Points
                                                                                                                                                                                                                                                                          1. Leverage Values
                                                                                                                                                                                                                                                                            1. Cook's Distance
                                                                                                                                                                                                                                                                              1. DFBETAS and DFFITS
                                                                                                                                                                                                                                                                                1. Studentized Residuals
                                                                                                                                                                                                                                                                              2. Multicollinearity Assessment
                                                                                                                                                                                                                                                                                1. Variance Inflation Factor (VIF)
                                                                                                                                                                                                                                                                                  1. Definition and Calculation
                                                                                                                                                                                                                                                                                    1. Interpretation Guidelines
                                                                                                                                                                                                                                                                                      1. Threshold Values (VIF > 5 or 10)
                                                                                                                                                                                                                                                                                      2. Condition Index
                                                                                                                                                                                                                                                                                        1. Eigenvalue-Based Measure
                                                                                                                                                                                                                                                                                          1. Multicollinearity Severity
                                                                                                                                                                                                                                                                                          2. Correlation Matrix Examination
                                                                                                                                                                                                                                                                                            1. High Pairwise Correlations
                                                                                                                                                                                                                                                                                              1. Limitations of Pairwise Approach
                                                                                                                                                                                                                                                                                              2. Remedies for Multicollinearity
                                                                                                                                                                                                                                                                                                1. Variable Removal
                                                                                                                                                                                                                                                                                                  1. Principal Components Regression
                                                                                                                                                                                                                                                                                                    1. Ridge Regression
                                                                                                                                                                                                                                                                                                      1. Data Collection Strategies
                                                                                                                                                                                                                                                                                                    2. Model Validation
                                                                                                                                                                                                                                                                                                      1. Cross-Validation
                                                                                                                                                                                                                                                                                                        1. Hold-Out Validation
                                                                                                                                                                                                                                                                                                          1. k-Fold Cross-Validation
                                                                                                                                                                                                                                                                                                            1. Leave-One-Out Cross-Validation
                                                                                                                                                                                                                                                                                                            2. Training vs. Validation vs. Test Sets
                                                                                                                                                                                                                                                                                                              1. Data Splitting Strategies
                                                                                                                                                                                                                                                                                                                1. Overfitting Prevention
                                                                                                                                                                                                                                                                                                                  1. Generalization Assessment
                                                                                                                                                                                                                                                                                                                  2. Out-of-Sample Prediction
                                                                                                                                                                                                                                                                                                                    1. Model Performance on New Data
                                                                                                                                                                                                                                                                                                                      1. Prediction Intervals
                                                                                                                                                                                                                                                                                                                        1. Model Robustness