Transformer deep learning architecture

  1. Transformer Variants and Evolution
    1. Encoder-Only Architectures
      1. BERT Family
        1. BERT (Bidirectional Encoder Representations from Transformers)
          1. Masked Language Modeling
            1. Next Sentence Prediction
              1. Bidirectional Context
                1. Fine-tuning Approach
                2. RoBERTa
                  1. Training Improvements
                    1. Dynamic Masking
                      1. Larger Datasets
                        1. Hyperparameter Optimization
                        2. ALBERT
                          1. Parameter Sharing
                            1. Factorized Embeddings
                              1. Sentence Order Prediction
                                1. Model Compression
                                2. DeBERTa
                                  1. Disentangled Attention
                                    1. Enhanced Mask Decoder
                                      1. Relative Position Encoding
                                    2. Specialized Encoder Models
                                      1. DistilBERT
                                        1. Knowledge Distillation
                                          1. Model Compression
                                          2. ELECTRA
                                            1. Replaced Token Detection
                                              1. Generator-Discriminator Training
                                          3. Decoder-Only Architectures
                                            1. GPT Family
                                              1. GPT-1
                                                1. Unsupervised Pre-training
                                                  1. Supervised Fine-tuning
                                                    1. Transfer Learning
                                                    2. GPT-2
                                                      1. Scale Increase
                                                        1. Zero-shot Task Performance
                                                          1. Improved Training
                                                          2. GPT-3
                                                            1. Massive Scale
                                                              1. In-context Learning
                                                                1. Few-shot Capabilities
                                                                  1. Emergent Abilities
                                                                  2. GPT-4 and Beyond
                                                                    1. Multimodal Capabilities
                                                                      1. Improved Reasoning
                                                                        1. Safety Considerations
                                                                      2. Other Decoder Models
                                                                        1. PaLM
                                                                          1. Pathways Language Model
                                                                            1. Scaling Laws
                                                                            2. LaMDA
                                                                              1. Dialogue Applications
                                                                                1. Safety Filtering
                                                                            3. Encoder-Decoder Architectures
                                                                              1. T5 (Text-to-Text Transfer Transformer)
                                                                                1. Unified Framework
                                                                                  1. Text-to-Text Formulation
                                                                                    1. Pre-training Tasks
                                                                                      1. Span Corruption
                                                                                      2. BART
                                                                                        1. Denoising Autoencoder
                                                                                          1. Corruption Strategies
                                                                                            1. Sequence-to-Sequence Tasks
                                                                                            2. Pegasus
                                                                                              1. Abstractive Summarization
                                                                                                1. Gap Sentence Generation
                                                                                              2. Efficiency-Oriented Variants
                                                                                                1. Sparse Attention Models
                                                                                                  1. Longformer
                                                                                                    1. Sliding Window Attention
                                                                                                      1. Global Attention
                                                                                                        1. Linear Complexity
                                                                                                        2. BigBird
                                                                                                          1. Sparse Attention Patterns
                                                                                                            1. Random Attention
                                                                                                              1. Global Tokens
                                                                                                              2. Performer
                                                                                                                1. FAVOR+ Algorithm
                                                                                                                  1. Kernel-based Approximation
                                                                                                                2. Linear Attention Models
                                                                                                                  1. Linformer
                                                                                                                    1. Low-rank Attention
                                                                                                                      1. Linear Complexity
                                                                                                                      2. Linear Transformer
                                                                                                                        1. Kernel Trick Application
                                                                                                                      3. Memory-Efficient Models
                                                                                                                        1. Reformer
                                                                                                                          1. Locality-Sensitive Hashing
                                                                                                                            1. Reversible Layers
                                                                                                                              1. Memory Optimization
                                                                                                                              2. Synthesizer
                                                                                                                                1. Learned Attention Patterns
                                                                                                                                  1. Reduced Computation