Big Data Technologies

  1. Modern Data Processing with Apache Spark
    1. Introduction to Apache Spark
      1. Advantages over MapReduce
        1. In-Memory Processing
          1. Ease of Use
            1. Performance Improvements
              1. Iterative Algorithm Support
              2. Unified Analytics Engine
                1. Batch Processing
                  1. Stream Processing
                    1. Machine Learning Integration
                      1. Graph Processing
                      2. Spark Programming Languages
                        1. Scala
                          1. Python (PySpark)
                            1. Java
                              1. R (SparkR)
                                1. SQL
                              2. Spark Core Concepts
                                1. Resilient Distributed Datasets (RDDs)
                                  1. Creation of RDDs
                                    1. From Files
                                      1. From Collections
                                        1. From Other RDDs
                                        2. Transformations
                                          1. Map
                                            1. Filter
                                              1. FlatMap
                                                1. GroupBy
                                                  1. Join
                                                    1. Union
                                                    2. Actions
                                                      1. Collect
                                                        1. Count
                                                          1. Save
                                                            1. Reduce
                                                              1. Foreach
                                                              2. Lazy Evaluation
                                                                1. Execution Plan
                                                                  1. Optimization
                                                                    1. Lineage Graph
                                                                    2. RDD Persistence
                                                                      1. Caching Strategies
                                                                        1. Storage Levels
                                                                      2. Spark Architecture
                                                                        1. Driver Program
                                                                          1. Job Coordination
                                                                            1. Task Scheduling
                                                                              1. SparkContext
                                                                              2. Cluster Manager
                                                                                1. Standalone
                                                                                  1. YARN
                                                                                    1. Mesos
                                                                                      1. Kubernetes
                                                                                      2. Executors
                                                                                        1. Task Execution
                                                                                          1. Memory Management
                                                                                            1. JVM Processes
                                                                                          2. Spark Execution Model
                                                                                            1. Jobs and Stages
                                                                                              1. Tasks and Partitions
                                                                                                1. Shuffle Operations
                                                                                                  1. Dynamic Allocation
                                                                                                2. Structured APIs
                                                                                                  1. DataFrames
                                                                                                    1. Schema Definition
                                                                                                      1. Operations on DataFrames
                                                                                                        1. Catalyst Optimizer
                                                                                                        2. Datasets
                                                                                                          1. Type Safety
                                                                                                            1. Encoders
                                                                                                              1. Performance Benefits
                                                                                                              2. Spark SQL
                                                                                                                1. SQL Queries
                                                                                                                  1. Integration with DataFrames
                                                                                                                    1. Hive Integration
                                                                                                                      1. JDBC/ODBC Support
                                                                                                                    2. Spark Ecosystem Components
                                                                                                                      1. Spark Streaming (Legacy)
                                                                                                                        1. DStream Abstraction
                                                                                                                          1. Micro-Batch Processing
                                                                                                                            1. Input Sources
                                                                                                                              1. Output Operations
                                                                                                                              2. Structured Streaming
                                                                                                                                1. Event-Time Processing
                                                                                                                                  1. Continuous Applications
                                                                                                                                    1. Watermarking
                                                                                                                                      1. Triggers
                                                                                                                                      2. MLlib (Machine Learning Library)
                                                                                                                                        1. Algorithms
                                                                                                                                          1. Classification
                                                                                                                                            1. Regression
                                                                                                                                              1. Clustering
                                                                                                                                                1. Collaborative Filtering
                                                                                                                                                2. Pipelines
                                                                                                                                                  1. Transformers
                                                                                                                                                    1. Estimators
                                                                                                                                                      1. Pipeline Construction
                                                                                                                                                      2. Feature Engineering
                                                                                                                                                        1. Model Selection
                                                                                                                                                        2. GraphX (Graph Processing)
                                                                                                                                                          1. Graph Representation
                                                                                                                                                            1. Graph Algorithms
                                                                                                                                                              1. PageRank
                                                                                                                                                                1. Connected Components
                                                                                                                                                                  1. Triangle Counting
                                                                                                                                                                  2. Graph Operations
                                                                                                                                                                2. Spark Performance Optimization
                                                                                                                                                                  1. Memory Management
                                                                                                                                                                    1. Heap Memory
                                                                                                                                                                      1. Off-Heap Storage
                                                                                                                                                                        1. Garbage Collection Tuning
                                                                                                                                                                        2. Partitioning Strategies
                                                                                                                                                                          1. Data Partitioning
                                                                                                                                                                            1. Custom Partitioners
                                                                                                                                                                            2. Caching and Persistence
                                                                                                                                                                              1. Storage Levels
                                                                                                                                                                                1. Cache Management
                                                                                                                                                                                2. Broadcast Variables
                                                                                                                                                                                  1. Efficient Data Distribution
                                                                                                                                                                                  2. Accumulators
                                                                                                                                                                                    1. Distributed Counters