Useful Links
Computer Science
Artificial Intelligence
Deep Learning
Transformer deep learning architecture
1. Foundational Concepts and Predecessors
2. The Original Transformer Architecture
3. Transformer Encoder
4. Transformer Decoder
5. Output Generation and Decoding
6. Training Methodology
7. Mathematical Foundations
8. Architectural Analysis
9. Interpretability and Analysis
10. Transformer Variants and Evolution
11. Advanced Attention Mechanisms
12. Applications and Adaptations
13. Implementation Considerations
Transformer Decoder
Decoder Layer Architecture
Sub-layer Organization
Masked Multi-Head Self-Attention
Encoder-Decoder Attention
Position-wise Feed-Forward Network
Residual Connections and Normalization
Autoregressive Property
Masked Multi-Head Self-Attention
Masking Necessity
Preventing Future Information Access
Autoregressive Generation
Training vs Inference Consistency
Look-Ahead Mask Implementation
Mask Matrix Construction
Upper Triangular Masking
Attention Score Modification
Softmax Application
Causal Attention Pattern
Encoder-Decoder Attention (Cross-Attention)
Cross-Attention Mechanism
Query Source (Decoder)
Key and Value Source (Encoder)
Information Integration
Encoder-Decoder Alignment
Source-Target Correspondence
Dynamic Context Selection
Attention Pattern Analysis
Position-wise Feed-Forward Network
Identical Structure to Encoder
Independent Processing
Residual Connections and Normalization
Consistent Application
Gradient Flow Maintenance
Decoder Stack
Layer Stacking
Autoregressive Processing
Output Generation Preparation
Previous
3. Transformer Encoder
Go to top
Next
5. Output Generation and Decoding