Natural Language Processing (NLP)

  1. Text Processing and Preprocessing
    1. Data Acquisition
      1. Text Corpora
        1. Corpus Types and Characteristics
          1. Corpus Annotation
            1. Corpus Licensing
            2. Web Scraping
              1. HTML Parsing
                1. API-Based Collection
                  1. Ethical Considerations
                  2. Data Quality Assessment
                  3. Text Cleaning and Normalization
                    1. Character Encoding
                      1. Unicode Handling
                        1. Encoding Detection
                        2. HTML and Markup Removal
                          1. Special Character Processing
                            1. Case Normalization
                              1. Whitespace Normalization
                                1. Number and Symbol Handling
                                2. Tokenization
                                  1. Word Tokenization
                                    1. Rule-Based Methods
                                      1. Statistical Methods
                                        1. Language-Specific Challenges
                                        2. Sentence Segmentation
                                          1. Boundary Detection
                                            1. Abbreviation Handling
                                            2. Subword Tokenization
                                              1. Byte-Pair Encoding
                                                1. WordPiece
                                                  1. SentencePiece
                                                    1. Unigram Language Model
                                                  2. Lexical Processing
                                                    1. Stop Word Removal
                                                      1. Standard Stop Lists
                                                        1. Domain-Specific Stop Words
                                                          1. Impact on Tasks
                                                          2. Stemming
                                                            1. Porter Stemmer
                                                              1. Snowball Stemmer
                                                                1. Language-Specific Stemmers
                                                                2. Lemmatization
                                                                  1. Dictionary-Based Methods
                                                                    1. Rule-Based Methods
                                                                      1. Statistical Methods
                                                                    2. Text Normalization
                                                                      1. Spelling Correction
                                                                        1. Abbreviation Expansion
                                                                          1. Slang and Informal Language
                                                                            1. Social Media Text Processing