Big Data Technologies

  1. Supporting Ecosystem and Tools
    1. Data Ingestion and Integration
      1. Apache Sqoop
        1. Importing Data from RDBMS
          1. Exporting Data to RDBMS
            1. Incremental Imports
              1. Parallel Processing
              2. Apache Flume
                1. Log Data Collection
                  1. Event Delivery
                    1. Agent Configuration
                      1. Reliability Mechanisms
                      2. Logstash
                        1. Data Pipeline Configuration
                          1. Plugin Ecosystem
                            1. Input, Filter, Output Plugins
                            2. Apache NiFi
                              1. Data Flow Management
                                1. Visual Interface
                                  1. Provenance Tracking
                                  2. Talend
                                    1. ETL Tool
                                      1. Data Integration Platform
                                    2. Workflow Orchestration and Scheduling
                                      1. Apache Airflow
                                        1. Directed Acyclic Graphs (DAGs)
                                          1. Task Scheduling
                                            1. Operators
                                              1. Sensors
                                                1. XComs
                                                2. Oozie
                                                  1. Workflow Definition
                                                    1. Integration with Hadoop
                                                      1. Coordinator Jobs
                                                      2. Luigi
                                                        1. Python-Based Workflow
                                                          1. Dependency Resolution
                                                          2. Prefect
                                                            1. Modern Workflow Engine
                                                              1. Dynamic Workflows
                                                            2. Cluster Management and Monitoring
                                                              1. Apache Ambari
                                                                1. Cluster Provisioning
                                                                  1. Service Monitoring
                                                                    1. Configuration Management
                                                                    2. Cloudera Manager
                                                                      1. Configuration Management
                                                                        1. Performance Monitoring
                                                                          1. Health Checks
                                                                          2. Ganglia
                                                                            1. Distributed Monitoring
                                                                              1. Metrics Collection
                                                                              2. Nagios
                                                                                1. Infrastructure Monitoring
                                                                                  1. Alerting
                                                                                2. Data Catalogs and Metadata Management
                                                                                  1. Apache Atlas
                                                                                    1. Metadata Collection
                                                                                      1. Data Lineage Tracking
                                                                                        1. Data Classification
                                                                                        2. LinkedIn DataHub
                                                                                          1. Metadata Platform
                                                                                            1. Data Discovery
                                                                                            2. AWS Glue Data Catalog
                                                                                              1. Managed Metadata Repository
                                                                                              2. Apache Hive Metastore
                                                                                                1. Schema Repository
                                                                                                  1. Table Metadata