Machine Learning with Apache Spark

  1. Core Apache Spark Concepts for Machine Learning
    1. Spark Architecture
      1. Driver Program
        1. Role and Responsibilities
          1. SparkContext and SparkSession
            1. Communication with Cluster Manager
              1. Resource Coordination
              2. Cluster Manager
                1. Standalone Cluster Manager
                  1. YARN Integration
                    1. Mesos Integration
                      1. Kubernetes Integration
                        1. Resource Allocation Strategies
                        2. Executor Nodes
                          1. Task Execution Environment
                            1. Memory Management
                              1. Storage Management
                                1. JVM Configuration
                                2. Spark Jobs and Execution Model
                                  1. Job Lifecycle
                                    1. Stage Division
                                      1. Task Scheduling
                                        1. DAG Scheduler
                                          1. Task Scheduler
                                        2. Spark's Data Abstractions
                                          1. Resilient Distributed Datasets
                                            1. RDD Fundamentals
                                              1. RDD Creation Methods
                                                1. Parallelizing Collections
                                                  1. Loading External Datasets
                                                    1. Transforming Existing RDDs
                                                    2. RDD Operations
                                                      1. Transformations
                                                        1. map
                                                          1. filter
                                                            1. flatMap
                                                              1. union
                                                                1. intersection
                                                                  1. distinct
                                                                  2. Actions
                                                                    1. collect
                                                                      1. count
                                                                        1. reduce
                                                                          1. take
                                                                            1. saveAsTextFile
                                                                          2. RDD Properties
                                                                            1. Laziness and Evaluation
                                                                              1. Immutability
                                                                                1. Fault Tolerance via Lineage
                                                                                  1. Partitioning
                                                                                2. DataFrames
                                                                                  1. DataFrame Fundamentals
                                                                                    1. Structure and Schema
                                                                                      1. Creating DataFrames
                                                                                        1. From RDDs
                                                                                          1. From Data Sources
                                                                                            1. Programmatically
                                                                                            2. Advantages over RDDs for ML
                                                                                              1. Catalyst Optimizer Integration
                                                                                                1. Interoperability with RDDs
                                                                                                2. Datasets
                                                                                                  1. Type-Safety Benefits
                                                                                                    1. Performance Optimizations
                                                                                                      1. Dataset API Features
                                                                                                        1. Encoder Usage
                                                                                                          1. Use Cases for Datasets
                                                                                                        2. Interacting with Spark
                                                                                                          1. Spark Shell
                                                                                                            1. Scala Shell
                                                                                                              1. Starting and Configuration
                                                                                                                1. Interactive Commands
                                                                                                                2. Python Shell (PySpark)
                                                                                                                  1. Setup and Configuration
                                                                                                                    1. Interactive Data Exploration
                                                                                                                  2. Spark Submit
                                                                                                                    1. Application Submission
                                                                                                                      1. Resource Configuration
                                                                                                                        1. Deployment Modes
                                                                                                                          1. Command Line Options
                                                                                                                          2. Notebook Environments
                                                                                                                            1. Jupyter Notebooks
                                                                                                                              1. Setting Up Spark with Jupyter
                                                                                                                                1. Kernel Configuration
                                                                                                                                  1. Interactive Development Workflow
                                                                                                                                  2. Apache Zeppelin
                                                                                                                                    1. Features and Capabilities
                                                                                                                                      1. Spark Integration
                                                                                                                                        1. Visualization Support
                                                                                                                                        2. Databricks Notebooks
                                                                                                                                          1. Collaborative Features
                                                                                                                                            1. Managed Spark Environment
                                                                                                                                              1. Built-in Visualizations