Useful Links
Computer Science
Artificial Intelligence
Deep Learning
Transformer deep learning architecture
1. Foundational Concepts and Predecessors
2. The Original Transformer Architecture
3. Transformer Encoder
4. Transformer Decoder
5. Output Generation and Decoding
6. Training Methodology
7. Mathematical Foundations
8. Architectural Analysis
9. Interpretability and Analysis
10. Transformer Variants and Evolution
11. Advanced Attention Mechanisms
12. Applications and Adaptations
13. Implementation Considerations
Transformer Variants and Evolution
Encoder-Only Architectures
BERT Family
BERT (Bidirectional Encoder Representations from Transformers)
Masked Language Modeling
Next Sentence Prediction
Bidirectional Context
Fine-tuning Approach
RoBERTa
Training Improvements
Dynamic Masking
Larger Datasets
Hyperparameter Optimization
ALBERT
Parameter Sharing
Factorized Embeddings
Sentence Order Prediction
Model Compression
DeBERTa
Disentangled Attention
Enhanced Mask Decoder
Relative Position Encoding
Specialized Encoder Models
DistilBERT
Knowledge Distillation
Model Compression
ELECTRA
Replaced Token Detection
Generator-Discriminator Training
Decoder-Only Architectures
GPT Family
GPT-1
Unsupervised Pre-training
Supervised Fine-tuning
Transfer Learning
GPT-2
Scale Increase
Zero-shot Task Performance
Improved Training
GPT-3
Massive Scale
In-context Learning
Few-shot Capabilities
Emergent Abilities
GPT-4 and Beyond
Multimodal Capabilities
Improved Reasoning
Safety Considerations
Other Decoder Models
PaLM
Pathways Language Model
Scaling Laws
LaMDA
Dialogue Applications
Safety Filtering
Encoder-Decoder Architectures
T5 (Text-to-Text Transfer Transformer)
Unified Framework
Text-to-Text Formulation
Pre-training Tasks
Span Corruption
BART
Denoising Autoencoder
Corruption Strategies
Sequence-to-Sequence Tasks
Pegasus
Abstractive Summarization
Gap Sentence Generation
Efficiency-Oriented Variants
Sparse Attention Models
Longformer
Sliding Window Attention
Global Attention
Linear Complexity
BigBird
Sparse Attention Patterns
Random Attention
Global Tokens
Performer
FAVOR+ Algorithm
Kernel-based Approximation
Linear Attention Models
Linformer
Low-rank Attention
Linear Complexity
Linear Transformer
Kernel Trick Application
Memory-Efficient Models
Reformer
Locality-Sensitive Hashing
Reversible Layers
Memory Optimization
Synthesizer
Learned Attention Patterns
Reduced Computation
Previous
9. Interpretability and Analysis
Go to top
Next
11. Advanced Attention Mechanisms