Data Cleaning

  1. Common Types of Data Quality Issues
    1. Missing Data Problems
      1. Types of Missingness
        1. Missing Completely at Random (MCAR)
          1. Missing at Random (MAR)
            1. Missing Not at Random (MNAR)
            2. Missingness Patterns
              1. Univariate Missingness
                1. Monotone Missingness
                  1. Arbitrary Missingness
                  2. Missing Data Mechanisms
                    1. System Failures
                      1. Data Collection Issues
                        1. User Input Errors
                          1. Processing Errors
                          2. Impact Assessment
                            1. Analysis Bias
                              1. Statistical Power Reduction
                                1. Model Performance Degradation
                              2. Inaccurate and Invalid Data
                                1. Outliers and Anomalies
                                  1. Statistical Outliers
                                    1. Contextual Outliers
                                      1. Collective Outliers
                                      2. Data Entry Errors
                                        1. Typographical Errors
                                          1. Transcription Mistakes
                                            1. Copy-Paste Errors
                                            2. Measurement Errors
                                              1. Instrument Calibration Issues
                                                1. Human Measurement Errors
                                                  1. Environmental Factors
                                                  2. Factual Inaccuracies
                                                    1. Outdated Information
                                                      1. Incorrect References
                                                        1. Misattributed Data
                                                        2. Logical Inconsistencies
                                                          1. Cross-Field Contradictions
                                                            1. Temporal Inconsistencies
                                                              1. Business Rule Violations
                                                            2. Inconsistent and Redundant Data
                                                              1. Duplicate Records
                                                                1. Exact Duplicates
                                                                  1. Near Duplicates
                                                                    1. Partial Duplicates
                                                                    2. Contradictory Information
                                                                      1. Conflicting Values
                                                                        1. Version Conflicts
                                                                          1. Source Disagreements
                                                                          2. Format Inconsistencies
                                                                            1. Date Format Variations
                                                                              1. Number Format Differences
                                                                                1. Text Case Variations
                                                                                2. Categorical Label Inconsistencies
                                                                                  1. Spelling Variations
                                                                                    1. Abbreviation Differences
                                                                                      1. Synonym Usage
                                                                                      2. Encoding Issues
                                                                                        1. Character Encoding Problems
                                                                                          1. Special Character Handling
                                                                                            1. Unicode Inconsistencies
                                                                                          2. Structural and Formatting Problems
                                                                                            1. Schema Issues
                                                                                              1. Inconsistent Column Names
                                                                                                1. Variable Data Types
                                                                                                  1. Missing Columns
                                                                                                  2. Data Type Mismatches
                                                                                                    1. Numeric Data as Text
                                                                                                      1. Date Data as Text
                                                                                                        1. Boolean Data Inconsistencies
                                                                                                        2. Text and String Issues
                                                                                                          1. Unstructured Text in Structured Fields
                                                                                                            1. Embedded Delimiters
                                                                                                              1. Leading and Trailing Whitespace
                                                                                                              2. Delimiter and Separator Problems
                                                                                                                1. Inconsistent Delimiters
                                                                                                                  1. Escaped Characters
                                                                                                                    1. Nested Separators
                                                                                                                    2. Column Alignment Issues
                                                                                                                      1. Shifted Columns
                                                                                                                        1. Missing Headers
                                                                                                                          1. Extra Columns
                                                                                                                          2. Multi-Value Fields
                                                                                                                            1. Lists in Single Fields
                                                                                                                              1. Concatenated Values
                                                                                                                                1. Nested Structures
                                                                                                                              2. Irrelevant and Noisy Data
                                                                                                                                1. Unnecessary Features
                                                                                                                                  1. Redundant Columns
                                                                                                                                    1. Derived Variables
                                                                                                                                      1. Constant Values
                                                                                                                                      2. Out-of-Scope Records
                                                                                                                                        1. Temporal Misalignment
                                                                                                                                          1. Geographic Misalignment
                                                                                                                                            1. Population Misalignment
                                                                                                                                            2. Noise and Artifacts
                                                                                                                                              1. Random Noise
                                                                                                                                                1. Systematic Noise
                                                                                                                                                  1. Processing Artifacts
                                                                                                                                                  2. Obsolete Information
                                                                                                                                                    1. Deprecated Fields
                                                                                                                                                      1. Historical Artifacts
                                                                                                                                                        1. Legacy System Remnants