Data Cleaning

  1. Tools and Technologies for Data Cleaning
    1. Spreadsheet Applications
      1. Microsoft Excel
        1. Built-in Functions
          1. Text Functions
            1. Date Functions
              1. Lookup Functions
                1. Statistical Functions
                2. Data Validation Tools
                  1. Input Restrictions
                    1. Custom Validation Rules
                      1. Error Alerts
                        1. Data Lists
                        2. Advanced Features
                          1. Conditional Formatting
                            1. Pivot Tables
                              1. Power Query
                                1. VBA Macros
                              2. Google Sheets
                                1. Native Functions
                                  1. Add-ons and Extensions
                                    1. Google Apps Script
                                      1. Collaboration Features
                                      2. LibreOffice Calc
                                        1. Open Source Alternative
                                          1. Macro Capabilities
                                            1. Extension Support
                                          2. Programming Languages and Libraries
                                            1. Python Ecosystem
                                              1. Core Libraries
                                                1. Pandas
                                                  1. DataFrame Operations
                                                    1. Data Manipulation
                                                      1. Missing Data Handling
                                                        1. Grouping and Aggregation
                                                        2. NumPy
                                                          1. Array Operations
                                                            1. Mathematical Functions
                                                              1. Broadcasting
                                                                1. Linear Algebra
                                                              2. Specialized Libraries
                                                                1. Polars
                                                                  1. High-Performance DataFrames
                                                                    1. Lazy Evaluation
                                                                      1. Memory Efficiency
                                                                      2. Dask
                                                                        1. Parallel Computing
                                                                          1. Out-of-Core Processing
                                                                            1. Scalable Analytics
                                                                            2. Modin
                                                                              1. Pandas Acceleration
                                                                                1. Distributed Computing
                                                                              2. Text Processing
                                                                                1. Regular Expressions (re)
                                                                                  1. String Methods
                                                                                    1. Natural Language Toolkit (NLTK)
                                                                                      1. spaCy
                                                                                      2. Data Quality Libraries
                                                                                        1. Great Expectations
                                                                                          1. Pandera
                                                                                            1. Cerberus
                                                                                              1. Schema
                                                                                              2. Utility Libraries
                                                                                                1. Pyjanitor
                                                                                                  1. Missingno
                                                                                                    1. Fuzzywuzzy
                                                                                                      1. Dedupe
                                                                                                    2. R Programming
                                                                                                      1. Core Packages
                                                                                                        1. dplyr
                                                                                                          1. Data Manipulation Grammar
                                                                                                            1. Pipe Operations
                                                                                                              1. Grouping Functions
                                                                                                              2. tidyr
                                                                                                                1. Data Reshaping
                                                                                                                  1. Missing Data Tools
                                                                                                                    1. Nested Data Handling
                                                                                                                    2. data.table
                                                                                                                      1. High-Performance Operations
                                                                                                                        1. Memory Efficiency
                                                                                                                          1. Fast Aggregations
                                                                                                                        2. String Processing
                                                                                                                          1. stringr
                                                                                                                            1. stringi
                                                                                                                              1. Regular Expressions
                                                                                                                              2. Data Quality Packages
                                                                                                                                1. janitor
                                                                                                                                  1. VIM (Visualization and Imputation of Missing values)
                                                                                                                                    1. mice (Multiple Imputation)
                                                                                                                                      1. Hmisc
                                                                                                                                      2. Specialized Packages
                                                                                                                                        1. lubridate (Date/Time)
                                                                                                                                          1. forcats (Categorical Data)
                                                                                                                                            1. readr (Data Import)
                                                                                                                                        2. Database Systems and SQL
                                                                                                                                          1. SQL Data Manipulation
                                                                                                                                            1. Data Definition Language (DDL)
                                                                                                                                              1. Data Manipulation Language (DML)
                                                                                                                                                1. Data Query Language (DQL)
                                                                                                                                                  1. Data Control Language (DCL)
                                                                                                                                                  2. Advanced SQL Features
                                                                                                                                                    1. Window Functions
                                                                                                                                                      1. Common Table Expressions (CTEs)
                                                                                                                                                        1. Stored Procedures
                                                                                                                                                          1. User-Defined Functions
                                                                                                                                                          2. String and Date Functions
                                                                                                                                                            1. Text Processing Functions
                                                                                                                                                              1. Pattern Matching
                                                                                                                                                                1. Date Arithmetic
                                                                                                                                                                  1. Format Conversion
                                                                                                                                                                  2. Data Quality Constraints
                                                                                                                                                                    1. Primary Key Constraints
                                                                                                                                                                      1. Foreign Key Constraints
                                                                                                                                                                        1. Unique Constraints
                                                                                                                                                                          1. Check Constraints
                                                                                                                                                                            1. Not Null Constraints
                                                                                                                                                                            2. Database-Specific Features
                                                                                                                                                                              1. PostgreSQL
                                                                                                                                                                                1. Advanced Data Types
                                                                                                                                                                                  1. Full-Text Search
                                                                                                                                                                                    1. JSON Support
                                                                                                                                                                                    2. MySQL
                                                                                                                                                                                      1. String Functions
                                                                                                                                                                                        1. Date Functions
                                                                                                                                                                                          1. Regular Expressions
                                                                                                                                                                                          2. SQL Server
                                                                                                                                                                                            1. T-SQL Extensions
                                                                                                                                                                                              1. Data Quality Services
                                                                                                                                                                                                1. Integration Services
                                                                                                                                                                                                2. Oracle
                                                                                                                                                                                                  1. PL/SQL
                                                                                                                                                                                                    1. Advanced Analytics
                                                                                                                                                                                                      1. Data Mining
                                                                                                                                                                                                  2. Specialized Data Cleaning Tools
                                                                                                                                                                                                    1. Open Source Tools
                                                                                                                                                                                                      1. OpenRefine
                                                                                                                                                                                                        1. Interactive Data Cleaning
                                                                                                                                                                                                          1. Faceting and Filtering
                                                                                                                                                                                                            1. Clustering and Reconciliation
                                                                                                                                                                                                              1. Expression Language
                                                                                                                                                                                                              2. Apache Spark
                                                                                                                                                                                                                1. Distributed Processing
                                                                                                                                                                                                                  1. MLlib for Data Quality
                                                                                                                                                                                                                    1. Structured Streaming
                                                                                                                                                                                                                    2. Talend Open Studio
                                                                                                                                                                                                                      1. ETL Processes
                                                                                                                                                                                                                        1. Data Integration
                                                                                                                                                                                                                          1. Job Design
                                                                                                                                                                                                                        2. Commercial Tools
                                                                                                                                                                                                                          1. Trifacta Wrangler
                                                                                                                                                                                                                            1. Visual Data Preparation
                                                                                                                                                                                                                              1. Machine Learning Suggestions
                                                                                                                                                                                                                                1. Collaboration Features
                                                                                                                                                                                                                                2. Alteryx Designer
                                                                                                                                                                                                                                  1. Drag-and-Drop Interface
                                                                                                                                                                                                                                    1. Predictive Analytics
                                                                                                                                                                                                                                      1. Spatial Analytics
                                                                                                                                                                                                                                      2. Informatica Data Quality
                                                                                                                                                                                                                                        1. Enterprise Data Quality
                                                                                                                                                                                                                                          1. Data Profiling
                                                                                                                                                                                                                                            1. Data Standardization
                                                                                                                                                                                                                                            2. IBM InfoSphere QualityStage
                                                                                                                                                                                                                                              1. Data Standardization
                                                                                                                                                                                                                                                1. Matching and Deduplication
                                                                                                                                                                                                                                                  1. Data Investigation
                                                                                                                                                                                                                                                2. Cloud-Based Solutions
                                                                                                                                                                                                                                                  1. AWS Glue DataBrew
                                                                                                                                                                                                                                                    1. Google Cloud Dataprep
                                                                                                                                                                                                                                                      1. Microsoft Power BI Dataflows
                                                                                                                                                                                                                                                        1. Databricks Data Engineering
                                                                                                                                                                                                                                                      2. Workflow and Pipeline Tools
                                                                                                                                                                                                                                                        1. Apache Airflow
                                                                                                                                                                                                                                                          1. Workflow Orchestration
                                                                                                                                                                                                                                                            1. Task Dependencies
                                                                                                                                                                                                                                                              1. Monitoring and Alerting
                                                                                                                                                                                                                                                              2. Prefect
                                                                                                                                                                                                                                                                1. Modern Workflow Engine
                                                                                                                                                                                                                                                                  1. Dynamic Workflows
                                                                                                                                                                                                                                                                    1. Error Handling
                                                                                                                                                                                                                                                                    2. Luigi
                                                                                                                                                                                                                                                                      1. Batch Job Pipeline
                                                                                                                                                                                                                                                                        1. Dependency Resolution
                                                                                                                                                                                                                                                                          1. Failure Recovery
                                                                                                                                                                                                                                                                          2. Dagster
                                                                                                                                                                                                                                                                            1. Data Pipeline Framework
                                                                                                                                                                                                                                                                              1. Type System
                                                                                                                                                                                                                                                                                1. Testing Framework