UsefulLinks
Computer Science
Artificial Intelligence
Deep Learning
Transformer deep learning architecture
1. Foundational Concepts and Predecessors
2. The Original Transformer Architecture
3. Transformer Encoder
4. Transformer Decoder
5. Output Generation and Decoding
6. Training Methodology
7. Mathematical Foundations
8. Architectural Analysis
9. Interpretability and Analysis
10. Transformer Variants and Evolution
11. Advanced Attention Mechanisms
12. Applications and Adaptations
13. Implementation Considerations
4.
Transformer Decoder
4.1.
Decoder Layer Architecture
4.1.1.
Sub-layer Organization
4.1.1.1.
Masked Multi-Head Self-Attention
4.1.1.2.
Encoder-Decoder Attention
4.1.1.3.
Position-wise Feed-Forward Network
4.1.1.4.
Residual Connections and Normalization
4.1.2.
Autoregressive Property
4.2.
Masked Multi-Head Self-Attention
4.2.1.
Masking Necessity
4.2.1.1.
Preventing Future Information Access
4.2.1.2.
Autoregressive Generation
4.2.1.3.
Training vs Inference Consistency
4.2.2.
Look-Ahead Mask Implementation
4.2.2.1.
Mask Matrix Construction
4.2.2.2.
Upper Triangular Masking
4.2.2.3.
Attention Score Modification
4.2.2.4.
Softmax Application
4.2.3.
Causal Attention Pattern
4.3.
Encoder-Decoder Attention (Cross-Attention)
4.3.1.
Cross-Attention Mechanism
4.3.1.1.
Query Source (Decoder)
4.3.1.2.
Key and Value Source (Encoder)
4.3.1.3.
Information Integration
4.3.2.
Encoder-Decoder Alignment
4.3.2.1.
Source-Target Correspondence
4.3.2.2.
Dynamic Context Selection
4.3.3.
Attention Pattern Analysis
4.4.
Position-wise Feed-Forward Network
4.4.1.
Identical Structure to Encoder
4.4.2.
Independent Processing
4.5.
Residual Connections and Normalization
4.5.1.
Consistent Application
4.5.2.
Gradient Flow Maintenance
4.6.
Decoder Stack
4.6.1.
Layer Stacking
4.6.2.
Autoregressive Processing
4.6.3.
Output Generation Preparation
Previous
3. Transformer Encoder
Go to top
Next
5. Output Generation and Decoding