Transformer deep learning architecture
Matrix Multiplication
Transpose Operations
Batch Matrix Operations
Embedding Spaces
Attention Score Spaces
Softmax Distribution
Categorical Distribution
Entropy
Cross-Entropy
KL Divergence
Gradient-based Optimization
Convexity and Non-convexity
Local vs Global Optima
Self-Attention Complexity
Feed-Forward Complexity
Memory Requirements
Attention Matrix Storage
Previous
6. Training Methodology
Go to top
Next
8. Architectural Analysis