Machine Learning with Apache Spark

  1. Data Preparation and Feature Engineering
    1. Loading and Saving Data
      1. Supported Data Sources
        1. Parquet Files
          1. ORC Files
            1. JSON Files
              1. CSV Files
                1. JDBC Databases
                  1. Text Files
                    1. Avro Files
                      1. Delta Lake
                      2. Reading Data into DataFrames
                        1. Data Source API
                          1. Schema Inference
                            1. Custom Schema Definition
                              1. Read Options and Configurations
                              2. Writing DataFrames to Storage
                                1. Write Modes
                                  1. Partitioning Strategies
                                    1. Compression Options
                                    2. Handling Large Datasets
                                      1. Memory Management
                                        1. Partitioning Considerations
                                          1. Performance Optimization
                                          2. Data Quality and Validation
                                            1. Schema Validation
                                              1. Data Profiling
                                                1. Constraint Checking
                                              2. Data Exploration and Manipulation with Spark SQL
                                                1. Querying DataFrames with SQL
                                                  1. Registering Temporary Views
                                                    1. Global Temporary Views
                                                      1. Executing SQL Queries
                                                        1. Complex Query Patterns
                                                        2. DataFrame API Operations
                                                          1. Selection Operations
                                                            1. Selecting Columns
                                                              1. Column Expressions
                                                                1. Conditional Selection
                                                                2. Filtering Operations
                                                                  1. Basic Filtering
                                                                    1. Complex Conditions
                                                                      1. Null Handling
                                                                      2. Grouping and Aggregation
                                                                        1. GroupBy Operations
                                                                          1. Aggregation Functions
                                                                            1. Window Functions
                                                                            2. Joining DataFrames
                                                                              1. Inner Joins
                                                                                1. Outer Joins
                                                                                  1. Cross Joins
                                                                                    1. Join Optimization
                                                                                    2. Sorting and Ordering
                                                                                      1. Single Column Sorting
                                                                                        1. Multiple Column Sorting
                                                                                          1. Custom Ordering
                                                                                          2. Handling Missing Data
                                                                                            1. Detecting Missing Values
                                                                                              1. Dropping Missing Values
                                                                                                1. Filling Missing Values
                                                                                                  1. Imputation Strategies
                                                                                                  2. Data Type Operations
                                                                                                    1. Casting Data Types
                                                                                                      1. Type Conversion
                                                                                                        1. Schema Evolution
                                                                                                        2. User-Defined Functions
                                                                                                          1. Creating UDFs
                                                                                                            1. Registering UDFs
                                                                                                              1. Performance Considerations
                                                                                                                1. Vectorized UDFs
                                                                                                            2. Feature Engineering with Spark ML
                                                                                                              1. Feature Extraction
                                                                                                                1. Text Feature Extraction
                                                                                                                  1. TF-IDF
                                                                                                                    1. Word2Vec
                                                                                                                      1. CountVectorizer
                                                                                                                        1. HashingTF
                                                                                                                          1. N-gram Extraction
                                                                                                                            1. StopWordsRemover
                                                                                                                            2. Numerical Feature Extraction
                                                                                                                              1. Polynomial Features
                                                                                                                                1. Interaction Features
                                                                                                                              2. Feature Transformation
                                                                                                                                1. Categorical Transformations
                                                                                                                                  1. StringIndexer
                                                                                                                                    1. IndexToString
                                                                                                                                      1. OneHotEncoder
                                                                                                                                      2. Vector Operations
                                                                                                                                        1. VectorAssembler
                                                                                                                                          1. VectorIndexer
                                                                                                                                            1. VectorSlicer
                                                                                                                                            2. Scaling Transformations
                                                                                                                                              1. Normalizer
                                                                                                                                                1. StandardScaler
                                                                                                                                                  1. MinMaxScaler
                                                                                                                                                    1. MaxAbsScaler
                                                                                                                                                      1. RobustScaler
                                                                                                                                                      2. Discretization
                                                                                                                                                        1. Bucketizer
                                                                                                                                                          1. QuantileDiscretizer
                                                                                                                                                          2. Dimensionality Reduction
                                                                                                                                                            1. PCA
                                                                                                                                                              1. SVD
                                                                                                                                                            2. Feature Selection
                                                                                                                                                              1. Statistical Selection
                                                                                                                                                                1. ChiSqSelector
                                                                                                                                                                  1. Univariate Feature Selection
                                                                                                                                                                    1. Variance Threshold
                                                                                                                                                                    2. Model-Based Selection
                                                                                                                                                                      1. Feature Importance
                                                                                                                                                                        1. Recursive Feature Elimination
                                                                                                                                                                      2. Advanced Feature Engineering
                                                                                                                                                                        1. Handling Categorical Variables
                                                                                                                                                                          1. High Cardinality Categories
                                                                                                                                                                            1. Rare Category Handling
                                                                                                                                                                              1. Target Encoding
                                                                                                                                                                              2. Handling Imbalanced Data
                                                                                                                                                                                1. Sampling Techniques
                                                                                                                                                                                  1. Class Weight Adjustment
                                                                                                                                                                                    1. Synthetic Data Generation
                                                                                                                                                                                    2. Dealing with Outliers
                                                                                                                                                                                      1. Outlier Detection
                                                                                                                                                                                        1. Outlier Treatment
                                                                                                                                                                                          1. Robust Statistics
                                                                                                                                                                                          2. Time Series Features
                                                                                                                                                                                            1. Date/Time Extraction
                                                                                                                                                                                              1. Lag Features
                                                                                                                                                                                                1. Rolling Statistics