Big Data Technologies

  1. The Hadoop Ecosystem
    1. Overview of Apache Hadoop
      1. Core Components
        1. HDFS
          1. MapReduce
            1. YARN
            2. History and Evolution
              1. Origins and Development
                1. Major Releases and Features
                  1. Community and Governance
                  2. Hadoop Distributions
                    1. Apache Hadoop
                      1. Cloudera Distribution
                        1. Hortonworks Data Platform
                          1. MapR Platform
                        2. Hadoop Distributed File System (HDFS)
                          1. HDFS Architecture
                            1. NameNode
                              1. Metadata Management
                                1. Single Point of Failure
                                  1. High Availability Solutions
                                  2. DataNode
                                    1. Data Storage
                                      1. Heartbeats and Block Reports
                                        1. Block Management
                                        2. Secondary NameNode
                                          1. Checkpointing
                                            1. Differences from NameNode
                                              1. Limitations
                                            2. Data Blocks and Replication
                                              1. Block Size
                                                1. Default Block Size
                                                  1. Block Size Optimization
                                                  2. Replication Factor
                                                    1. Default Replication
                                                      1. Rack Awareness
                                                      2. Data Placement Policy
                                                        1. Replica Placement Strategy
                                                          1. Network Topology
                                                        2. Reading and Writing Files
                                                          1. Write Path
                                                            1. Client Write Process
                                                              1. Pipeline Replication
                                                              2. Read Path
                                                                1. Client Read Process
                                                                  1. Block Location Discovery
                                                                  2. Data Consistency
                                                                    1. Write-Once-Read-Many Model
                                                                      1. Consistency Guarantees
                                                                    2. HDFS Commands and API
                                                                      1. File Operations
                                                                        1. Basic File Commands
                                                                          1. Directory Operations
                                                                          2. Permissions and Access Control
                                                                            1. POSIX-Style Permissions
                                                                              1. Access Control Lists
                                                                            2. HDFS Federation
                                                                              1. Multiple NameNodes
                                                                                1. Namespace Isolation
                                                                                2. HDFS Snapshots
                                                                                  1. Point-in-Time Copies
                                                                                    1. Snapshot Management
                                                                                  2. MapReduce Processing Paradigm
                                                                                    1. Core Concepts
                                                                                      1. The Map Function
                                                                                        1. Input Splits
                                                                                          1. Key-Value Pairs
                                                                                            1. Mapper Implementation
                                                                                            2. The Reduce Function
                                                                                              1. Aggregation
                                                                                                1. Output Generation
                                                                                                  1. Reducer Implementation
                                                                                                  2. Shuffling and Sorting
                                                                                                    1. Data Movement
                                                                                                      1. Sorting Mechanisms
                                                                                                        1. Partitioning
                                                                                                      2. MapReduce Execution Flow
                                                                                                        1. Job Submission
                                                                                                          1. Task Scheduling
                                                                                                            1. Task Execution
                                                                                                              1. Output Collection
                                                                                                              2. Writing a MapReduce Job
                                                                                                                1. Job Configuration
                                                                                                                  1. Mapper and Reducer Implementation
                                                                                                                    1. Input and Output Formats
                                                                                                                      1. Job Submission and Monitoring
                                                                                                                      2. Advanced MapReduce Concepts
                                                                                                                        1. Combiners
                                                                                                                          1. Partitioners
                                                                                                                            1. Counters
                                                                                                                              1. Distributed Cache
                                                                                                                              2. Limitations of MapReduce
                                                                                                                                1. Iterative Processing Challenges
                                                                                                                                  1. Latency Issues
                                                                                                                                    1. Programming Complexity
                                                                                                                                      1. Disk I/O Overhead
                                                                                                                                    2. Yet Another Resource Negotiator (YARN)
                                                                                                                                      1. YARN Architecture
                                                                                                                                        1. ResourceManager
                                                                                                                                          1. Resource Allocation
                                                                                                                                            1. Scheduler
                                                                                                                                              1. Application Management
                                                                                                                                              2. NodeManager
                                                                                                                                                1. Node Health Monitoring
                                                                                                                                                  1. Container Management
                                                                                                                                                    1. Local Resource Management
                                                                                                                                                    2. ApplicationMaster
                                                                                                                                                      1. Application Lifecycle Management
                                                                                                                                                        1. Resource Negotiation
                                                                                                                                                        2. Container
                                                                                                                                                          1. Resource Isolation
                                                                                                                                                            1. Process Execution
                                                                                                                                                          2. Resource Allocation and Job Scheduling
                                                                                                                                                            1. Scheduling Policies
                                                                                                                                                              1. FIFO Scheduler
                                                                                                                                                                1. Capacity Scheduler
                                                                                                                                                                  1. Fair Scheduler
                                                                                                                                                                  2. Multi-Tenancy Support
                                                                                                                                                                    1. Resource Preemption
                                                                                                                                                                    2. YARN Applications
                                                                                                                                                                      1. MapReduce on YARN
                                                                                                                                                                        1. Spark on YARN
                                                                                                                                                                          1. Custom Applications