Useful Links
Computer Science
Artificial Intelligence
Deep Learning
Transformer deep learning architecture
1. Foundational Concepts and Predecessors
2. The Original Transformer Architecture
3. Transformer Encoder
4. Transformer Decoder
5. Output Generation and Decoding
6. Training Methodology
7. Mathematical Foundations
8. Architectural Analysis
9. Interpretability and Analysis
10. Transformer Variants and Evolution
11. Advanced Attention Mechanisms
12. Applications and Adaptations
13. Implementation Considerations
Output Generation and Decoding
Final Linear Transformation
Vocabulary Projection
Hidden Dimension to Vocabulary Size
Logit Computation
Parameter Sharing
Embedding Matrix Reuse
Probability Distribution Generation
Softmax Application
Logit to Probability Conversion
Temperature Scaling
Next Token Prediction
Decoding Strategies
Greedy Decoding
Highest Probability Selection
Deterministic Output
Local Optimality
Sampling-based Decoding
Random Sampling
Temperature-controlled Sampling
Top-k Sampling
Top-p (Nucleus) Sampling
Beam Search Decoding
Multiple Hypothesis Tracking
Beam Width Parameter
Length Normalization
Diverse Beam Search
Decoding Strategy Comparison
Previous
4. Transformer Decoder
Go to top
Next
6. Training Methodology