MADL Encyclopedia
MADL Encyclopedia
In progress ...
The most comprehensive open reference on LLM and multimodal model architecture. MADL (Model Architecture Description Language) is a formal, human-readable notation for describing any LLM architecture -- from a 270M dense transformer to a 1T hybrid MoE -- precisely enough to reconstruct its computation graph, yet readable enough to compare models at a glance. See the formal grammar (EBNF).
Interactive diagrams, formal equations, PyTorch code, and cross-model comparisons -- all in one place.
116+
Model architectures
197
Book chapters
4
Deep dive analyses
18
Topic parts
>_
Architecture Dashboard
Browse 116+ model architectures with interactive SVG diagrams, filterable grid, side-by-side comparison, and a full MADL language reference. Every model parsed from source code.
Search, filter, compare, export -- all models in one view
{}
Architecture Encyclopedia
197-chapter deep reference covering every component from basic tensor operations to frontier architectures. Math equations, PyTorch code, inline diagrams, and real-model cross-references.
18 parts -- from attention foundations to scaling laws
~>
Model Deep Dives
Standalone one-pager analyses for the latest model releases. Full parameter breakdowns, benchmark comparisons, architecture SVG diagrams, and novel component deep dives -- verified against HuggingFace transformers source.
LFM2.5-VL, GLM-5.1, Gemma 4, and more

Model Deep Dives

One-pager technical analyses -- parameter breakdowns, architecture diagrams, benchmark tables, and novel component deep dives

Qwen3.6-35B-A3B Qwen Team / Alibaba
35B-total / 3B-active hybrid vision-language MoE. 40 layers in a 3:1 cycle of Gated DeltaNet (linear, 30 layers) and Gated Attention (full + output gate, 10 layers); 256 fine-grained experts per layer (top-8 + 1 shared) with ~3.15M-param SwiGLU experts; ViT-27 vision tower with 2x2 SpatialMerge. New in 3.6: a single Multi-Token Prediction head for self-speculative decoding and "thinking preservation" across turns. SWE-bench Verified 73.4, AIME 2026 92.7, MMMU 81.7.
Hybrid Linear+Attn MoE 256/8 +1s Vision MTP MADL
Params: 35B total / ~3B active Context: 262K (YaRN → 1M) License: apache-2.0
MiniMax-M2.7 MiniMax
229B-A11B sparse MoE with sigmoid routing + bias correction, partial RoPE (50%), upscaled attention dimension (6144 vs 3072 hidden), QK-norm, and 3-head multi-token prediction for self-speculative decoding. 256 experts with 8 active per token. State-of-the-art on coding and agentic benchmarks among open-source models.
MoE 256E/8A GQA Sigmoid Routing MTP MADL
Params: 229B total / 11B active Context: 192K License: modified-mit
LFM2.5-VL-450M Liquid AI
Sub-500M hybrid vision-language model. Not a transformer -- uses gated short convolution blocks (10 of 16 layers) with grouped-query attention only where long-range retrieval is needed (6 layers). SigLIP2 NaFlex vision encoder with PixelUnshuffle projector. 242ms on Jetson Orin. Beats SmolVLM2-500M across all benchmarks with fewer parameters.
Hybrid Conv+Attn Vision Edge 450M MADL
Params: 450M (350M LM + 86M vision + 10M proj) Context: 32K License: lfm1.0
GLM-5.1 Zhipu AI / Z.AI
754B-A40B MoE flagship with Multi-head Latent Attention (MLA), DeepSeek Sparse Attention (DSA) indexer with top-2048 selection, 256 experts with sigmoid+bias-correction routing, and multi-token prediction. Agentic-coding optimized: state-of-the-art on SWE-Bench Pro, CyberGym, BrowseComp.
MoE 256E/8A MLA DSA MADL
Params: 754B total / 40B active Context: 198K License: MIT
Gemma 4 Google DeepMind
Fourth-generation open model family with hybrid sliding/full attention, proportional RoPE, K=V weight sharing, per-layer embeddings (E-series), and parallel dense+MoE mixing. Four variants spanning server to on-device deployment. First generation where architecture -- not just scale -- is the primary axis of differentiation.
Dense MoE Vision Audio MADL
Variants: 31B, 26B-A4B, E4B, E2B Context: 256K License: Gemma

About This Project

A comprehensive, continuously evolving encyclopedia of LLM and multimodal model architecture, documenting 116+ architectures from 2022–2026. The site covers model components, attention mechanisms, feed-forward networks, normalization, MoE routing, SSM/recurrent alternatives, multimodal encoders, scaling laws, and design patterns -- all backed by a formal description language.

// A SELF-EVOLVING ENCYCLOPEDIA

The MADL Encyclopedia is an evolving artifact, continuously updated through an automated write–review–fix pipeline. Content improves through three mechanisms: analyzing new model releases from HuggingFace, autonomously generating and revising book chapters, and refining the underlying MADL specification as new architectural patterns emerge.

Each model architecture is described in MADL (Model Architecture Description Language) -- a formal, human-readable notation designed by studying 116+ real architectures. Every MADL file is verified against the actual huggingface/transformers source code.

// GENERATION PIPELINE

The 197-chapter book follows an autonomous write–review–fix loop:

1
Write
Claude Code generates the full chapter HTML with math, code, and SVG diagrams
2
Review
Codex/GPT reviews and scores each section independently across 10 dimensions
3
Fix
Only sections scoring <8 are revised -- good sections are preserved intact
4
Accept
Accept when overall ≥8 or max 5 fix rounds reached, then deploy automatically

// SCORING DIMENSIONS

Ten criteria guide the automated review. The overall score is the minimum across all dimensions -- a chapter must excel everywhere to pass.

Dimension What It Measures
A. QualityWriting depth and accuracy B. CorrectnessMath equations, param formulas, code C. CompletenessAll chapter sections present and thorough D. ClarityUnderstandable at all 3 reader levels E. ConsistencyTerminology, notation, style F. StyleMADL site tone: technical, precise G. IllustrationsDiagram descriptions and inline SVGs H. Model DiagramsArchitecture diagrams tied to real models I. Math EquationsDerivations, correctness, KaTeX rendering J. Code ExamplesPyTorch quality and runnability

// ARCHITECTURE COMPONENTS COVERED

Attention MHA, GQA, MLA, MQA, Gated Attention, Gated DeltaNet, KDA
FFN Standard FFN, SwiGLU, GeGLU, Mixture of Experts (MoE), Latent MoE, Shared Experts
State Space Mamba-2, Mamba-3 (MIMO SSM)
Recurrent mLSTM (Matrix Memory LSTM), RWKV v1–v8
Linear Attention DeltaAttn, GatedDeltaNet, LIVConv, Hyena
Normalization RMSNorm, LayerNorm, QK-Norm, pre/post/inside_post placement
Position RoPE (with YaRN, partial, theta scaling), ALiBi, NoPE, Absolute
Multimodal ViT, SigLIP2, CLIP, MLP/PixelShuffle/Resampler connectors

// HOW TO ADD A MODEL

# 1. Analyze a HuggingFace model (uses Claude API or Claude Code CLI)
python analyze.py deepseek-ai/DeepSeek-V3.2-Exp --claude-code

# 2. Fast config-only converter (no API key needed, 34 model types)
python hf_to_madl.py Qwen/Qwen3-8B

# 3. Rebuild the dashboard
python build_dashboard.py

# 4. Open in browser
open dashboard.html

// CITATION & ATTRIBUTION

If you use this encyclopedia in your research or work -- even if you don't quote directly -- if it helped you understand a model architecture, shaped your thinking, or saved you research time, a mention or backlink is appreciated.

@misc{kinas2026madl,
  author       = {Kinas, Remek},
  title        = {MADL Encyclopedia: A Comprehensive Reference
                  on LLM Architecture},
  year         = {2026},
  howpublished = {\url{https://madl.si5.pl}},
  note         = {116+ model architectures, 197 chapters,
                  formal MADL specification}
}

Plain text: Kinas, R. (2026). MADL Encyclopedia: A Comprehensive Reference on LLM Architecture. https://madl.si5.pl

// AUTHOR

Remek Kinas

Creator of the MADL specification, the Architecture Browser, and the automated chapter generation pipeline. Also author of OmniEvolve for evolutionary algorithm discovery.

// LICENSE

MIT License. Architecture details verified against huggingface/transformers source code. MADL model files and generated chapters may be shared freely with attribution.