MADL Encyclopedia

In progress ...

The most comprehensive open reference on LLM and multimodal model architecture. MADL (Model Architecture Description Language) is a formal, human-readable notation for describing any LLM architecture -- from a 270M dense transformer to a 1T hybrid MoE -- precisely enough to reconstruct its computation graph, yet readable enough to compare models at a glance. See the formal grammar (EBNF).

Interactive diagrams, formal equations, PyTorch code, and cross-model comparisons -- all in one place.

116+

Model architectures

197

Book chapters

Deep dive analyses

Topic parts

Architecture Dashboard

Browse 116+ model architectures with interactive SVG diagrams, filterable grid, side-by-side comparison, and a full MADL language reference. Every model parsed from source code.

Search, filter, compare, export -- all models in one view

{}

Architecture Encyclopedia

197-chapter deep reference covering every component from basic tensor operations to frontier architectures. Math equations, PyTorch code, inline diagrams, and real-model cross-references.

18 parts -- from attention foundations to scaling laws

Model Deep Dives

Standalone one-pager analyses for the latest model releases. Full parameter breakdowns, benchmark comparisons, architecture SVG diagrams, and novel component deep dives -- verified against HuggingFace transformers source.

LFM2.5-VL, GLM-5.1, Gemma 4, and more

Model Deep Dives

One-pager technical analyses -- parameter breakdowns, architecture diagrams, benchmark tables, and novel component deep dives

Qwen3.6-35B-A3B Qwen Team / Alibaba

35B-total / 3B-active hybrid vision-language MoE. 40 layers in a 3:1 cycle of Gated DeltaNet (linear, 30 layers) and Gated Attention (full + output gate, 10 layers); 256 fine-grained experts per layer (top-8 + 1 shared) with ~3.15M-param SwiGLU experts; ViT-27 vision tower with 2x2 SpatialMerge. New in 3.6: a single Multi-Token Prediction head for self-speculative decoding and "thinking preservation" across turns. SWE-bench Verified 73.4, AIME 2026 92.7, MMMU 81.7.

Hybrid Linear+Attn MoE 256/8 +1s Vision MTP MADL

Params: 35B total / ~3B active Context: 262K (YaRN → 1M) License: apache-2.0

MiniMax-M2.7 MiniMax

229B-A11B sparse MoE with sigmoid routing + bias correction, partial RoPE (50%), upscaled attention dimension (6144 vs 3072 hidden), QK-norm, and 3-head multi-token prediction for self-speculative decoding. 256 experts with 8 active per token. State-of-the-art on coding and agentic benchmarks among open-source models.

MoE 256E/8A GQA Sigmoid Routing MTP MADL

Params: 229B total / 11B active Context: 192K License: modified-mit

LFM2.5-VL-450M Liquid AI

Sub-500M hybrid vision-language model. Not a transformer -- uses gated short convolution blocks (10 of 16 layers) with grouped-query attention only where long-range retrieval is needed (6 layers). SigLIP2 NaFlex vision encoder with PixelUnshuffle projector. 242ms on Jetson Orin. Beats SmolVLM2-500M across all benchmarks with fewer parameters.

Hybrid Conv+Attn Vision Edge 450M MADL

Params: 450M (350M LM + 86M vision + 10M proj) Context: 32K License: lfm1.0

GLM-5.1 Zhipu AI / Z.AI

754B-A40B MoE flagship with Multi-head Latent Attention (MLA), DeepSeek Sparse Attention (DSA) indexer with top-2048 selection, 256 experts with sigmoid+bias-correction routing, and multi-token prediction. Agentic-coding optimized: state-of-the-art on SWE-Bench Pro, CyberGym, BrowseComp.

MoE 256E/8A MLA DSA MADL

Params: 754B total / 40B active Context: 198K License: MIT

Gemma 4 Google DeepMind

Fourth-generation open model family with hybrid sliding/full attention, proportional RoPE, K=V weight sharing, per-layer embeddings (E-series), and parallel dense+MoE mixing. Four variants spanning server to on-device deployment. First generation where architecture -- not just scale -- is the primary axis of differentiation.

Dense MoE Vision Audio MADL

Variants: 31B, 26B-A4B, E4B, E2B Context: 256K License: Gemma

About This Project

A comprehensive, continuously evolving encyclopedia of LLM and multimodal model architecture, documenting 116+ architectures from 2022–2026. The site covers model components, attention mechanisms, feed-forward networks, normalization, MoE routing, SSM/recurrent alternatives, multimodal encoders, scaling laws, and design patterns -- all backed by a formal description language.

// A SELF-EVOLVING ENCYCLOPEDIA

The MADL Encyclopedia is an evolving artifact, continuously updated through an automated write–review–fix pipeline. Content improves through three mechanisms: analyzing new model releases from HuggingFace, autonomously generating and revising book chapters, and refining the underlying MADL specification as new architectural patterns emerge.

Each model architecture is described in MADL (Model Architecture Description Language) -- a formal, human-readable notation designed by studying 116+ real architectures. Every MADL file is verified against the actual huggingface/transformers source code.

// GENERATION PIPELINE

The 197-chapter book follows an autonomous write–review–fix loop:

Write

Claude Code generates the full chapter HTML with math, code, and SVG diagrams

Review

Codex/GPT reviews and scores each section independently across 10 dimensions

Fix

Only sections scoring <8 are revised -- good sections are preserved intact

Accept when overall ≥8 or max 5 fix rounds reached, then deploy automatically

// SCORING DIMENSIONS

Ten criteria guide the automated review. The overall score is the minimum across all dimensions -- a chapter must excel everywhere to pass.

Dimension What It Measures

A. QualityWriting depth and accuracy B. CorrectnessMath equations, param formulas, code C. CompletenessAll chapter sections present and thorough D. ClarityUnderstandable at all 3 reader levels E. ConsistencyTerminology, notation, style F. StyleMADL site tone: technical, precise G. IllustrationsDiagram descriptions and inline SVGs H. Model DiagramsArchitecture diagrams tied to real models I. Math EquationsDerivations, correctness, KaTeX rendering J. Code ExamplesPyTorch quality and runnability

// ARCHITECTURE COMPONENTS COVERED

Attention MHA, GQA, MLA, MQA, Gated Attention, Gated DeltaNet, KDA

FFN Standard FFN, SwiGLU, GeGLU, Mixture of Experts (MoE), Latent MoE, Shared Experts

State Space Mamba-2, Mamba-3 (MIMO SSM)

Recurrent mLSTM (Matrix Memory LSTM), RWKV v1–v8

Linear Attention DeltaAttn, GatedDeltaNet, LIVConv, Hyena

Normalization RMSNorm, LayerNorm, QK-Norm, pre/post/inside_post placement

Position RoPE (with YaRN, partial, theta scaling), ALiBi, NoPE, Absolute

Multimodal ViT, SigLIP2, CLIP, MLP/PixelShuffle/Resampler connectors

// HOW TO ADD A MODEL

# 1. Analyze a HuggingFace model (uses Claude API or Claude Code CLI)
python analyze.py deepseek-ai/DeepSeek-V3.2-Exp --claude-code

# 2. Fast config-only converter (no API key needed, 34 model types)
python hf_to_madl.py Qwen/Qwen3-8B

# 3. Rebuild the dashboard
python build_dashboard.py

# 4. Open in browser
open dashboard.html

// CITATION & ATTRIBUTION

If you use this encyclopedia in your research or work -- even if you don't quote directly -- if it helped you understand a model architecture, shaped your thinking, or saved you research time, a mention or backlink is appreciated.

@misc{kinas2026madl,
  author       = {Kinas, Remek},
  title        = {MADL Encyclopedia: A Comprehensive Reference
                  on LLM Architecture},
  year         = {2026},
  howpublished = {\url{https://madl.si5.pl}},
  note         = {116+ model architectures, 197 chapters,
                  formal MADL specification}
}

Plain text: Kinas, R. (2026). MADL Encyclopedia: A Comprehensive Reference on LLM Architecture. https://madl.si5.pl

// AUTHOR

Remek Kinas

Creator of the MADL specification, the Architecture Browser, and the automated chapter generation pipeline. Also author of OmniEvolve for evolutionary algorithm discovery.

@KinasRemek

// LICENSE

MIT License. Architecture details verified against huggingface/transformers source code. MADL model files and generated chapters may be shared freely with attribution.