LFM2.5-VL-450M ARCHITECTURE

Liquid AI's sub-500M hybrid vision-language model — 2026 — arXiv:2511.23404 — lfm1.0 license

LFM2.5-VL-450M is Liquid AI's second-generation vision-language model, built on a fundamentally different backbone from standard transformers. Where most VLMs stack decoder-only transformer blocks (GPT, Gemma, Qwen), LFM2.5-VL uses a minimal hybrid of gated short convolution blocks and grouped-query attention blocks as its language backbone (LFM2-350M), with a SigLIP2 NaFlex vision encoder (86M) projecting image features into the text embedding space. The result is a 450M-parameter VLM that runs inference in 242ms on Jetson Orin -- fast enough for real-time edge deployment -- while matching or exceeding the benchmark scores of similarly-sized transformer-based VLMs like SmolVLM2-500M.

The Hybrid Backbone: Not a Transformer

The LFM2-350M language model at the core of this system is not a decoder-only transformer. It uses only 6 attention layers out of 16 total. The remaining 10 layers are gated short convolution blocks -- a much cheaper token-mixing mechanism that uses depthwise convolutions (kernel size 3) with input-dependent multiplicative gating. This design was found through hardware-in-the-loop search: Liquid AI systematically tested adding linear attention, state-space models, and extra convolution operators to these stacks and found that none improved aggregate quality over the minimal conv+attention hybrid.

Gated Conv Blocks (10 of 16 layers)
Input-dependent gating: B,C,h~ = Linear(h), y = B*h~, z = Conv_k(y), o = Linear(C*z). Kernel k=3. Fast local mixing with excellent CPU cache behavior. SwiGLU FFN after each.
GQA Blocks (6 of 16 layers)
Grouped-query attention with 16 query heads, 8 KV groups, head_dim=64. QK-Norm augmented. RoPE positional encoding. Handles long-range retrieval that convolutions cannot.

The architectural insight is that most token mixing in language does not require quadratic-cost attention -- local convolutions handle it with O(n) cost and better cache locality. Attention is only injected where long-range dependency retrieval is empirically necessary. This yields 2x faster prefill/decode and a smaller KV cache vs. a similarly-sized all-attention model.

What Changed from LFM2-VL to LFM2.5-VL

LFM2.5-VL is a post-training improvement over LFM2-VL using the same 450M architecture. The changes are entirely in training, not architecture:

  • Training data scaled from 10T to 28T tokens -- nearly 3x more pre-training data.
  • Preference optimization + RL post-training -- adds instruction-following, grounding, and reliability improvements.
  • Bounding box prediction -- new capability for object detection with spatial grounding (RefCOCO-M: 81.28).
  • Function calling support -- text-only tool-use capability (BFCLv4: 21.08).
  • Multilingual expansion to 9 languages -- English plus Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Spanish. MMMB improved from 54.29 to 68.09.
  • Instruction following dramatically improved -- MM-IFEval jumped from 32.93 to 45.00.
Model Overview
LFM2.5-VL-450M
Hybrid Conv+AttnVisionEdge
Total Params~450M
LM BackboneLFM2.5-350M (16 layers)
Vision EncoderSigLIP2 NaFlex (~86M)
Context32,768 tokens
Vocab Size65,536 (byte-level BPE)
Hidden Size1,024
Layers16 (10 conv + 6 attn)
Attention Heads16 (8 KV groups, GQA)
FFN Size4,608 (SwiGLU)
Head Dim64
Max Image Res512x512 (tiled up to 10 tiles)
Training Data28T tokens
SmolVLM2-500M-Video-Instruct comparison
TransformerVision
Total Params~500M
LM BackboneVLlama3 (32 layers)
Vision EncoderSigLIP (google/siglip-base-patch16-512)
Context8,192 tokens
Vocab Size49,280
Hidden Size960
Layers32 (all attention)
Attention Heads15 (5 KV heads, GQA)
FFN Size2,560
Head Dim64
Resampler6 layers, 16 heads, 64 latents
LFM2-VL-450M predecessor
Hybrid Conv+AttnVisionEdge
Total Params~450M
LM BackboneLFM2-350M (16 layers)
Vision EncoderSigLIP2 NaFlex (~86M)
Hidden Size1,024
Layers16 (10 conv + 6 attn)
Training Data10T tokens
Parameter Comparison
Component LFM2.5-VL-450M SmolVLM2-500M Notes
ArchitectureHybrid Conv + GQATransformer-onlyLFM2 replaces 62.5% of attention layers with gated convolutions
Total Parameters~450M~500MLFM2.5 is smaller yet outperforms on most benchmarks
LM BackboneLFM2.5-350MTransformer ~400MLFM2.5 uses only 6 attention layers out of 16
Vision EncoderSigLIP2 NaFlex 86MSigLIP-basedBoth use SigLIP family; LFM2.5 uses newer NaFlex variant
Hidden Size1,024--
Num Layers16 (10 conv + 6 attn)All attentionMixed layer types are the core architectural difference
Attention Heads16 (8 KV groups)--GQA with 2:1 query-to-KV ratio
Head Dimension64--
FFN Dimension4,608 (SwiGLU)--4.5x hidden size, SwiGLU activation
Conv Kernel Size3N/ADepthwise convolution in gated conv blocks
Context Length32,768--
Vocab Size65,536--Byte-level BPE tokenizer
Projector Hidden2,560--PixelUnshuffle + 2-layer MLP
Downsample Factor2 (4x token reduction)--PixelUnshuffle reduces vision tokens by factor 4
NormalizationPre-norm RMSNorm--Throughout all layers
Positional EncodingRoPE (attn only)--Conv blocks have no explicit positional encoding
Training Data28T tokens--Pre-training at 4K context + 1T at 32K context
DistillationLFM1-7B teacher--Top-K=32 tempered decoupled knowledge distillation
Benchmarks
Benchmark LFM2.5-VL-450M LFM2-VL-450M SmolVLM2-500M Category
MMStar43.0040.8738.20General VL understanding
RealWorldQA58.4352.0349.90Real-world visual QA
MMBench (dev en)60.9156.2752.32Comprehensive VL benchmark
POPE86.9383.7982.67Object hallucination
MMVet41.1033.8529.90Multi-modal conversation
OCRBench684657609OCR capability
RefCOCO-M81.28----Visual grounding (bboxes)
MMMB (multilingual)68.0954.2946.79Multilingual VL
MM-IFEval45.0032.9311.27Instruction following
BFCLv4 (function call)21.08----Tool use / function calling
LFM2.5-VL-450M leads on every benchmark. The largest improvements over the predecessor come from instruction following (MM-IFEval +12.07), multilingual understanding (MMMB +13.80), and conversational capability (MMVet +7.25). Against SmolVLM2-500M (a transformer-based model with ~10% more parameters), the advantage is consistent and often dramatic -- especially on instruction following (45.00 vs 11.27).
Edge Hardware Performance (512x512 input)
Device Latency Category
Jetson Orin242msEmbedded GPU (robotics, IoT)
AMD Ryzen AI Max+ 395944msLaptop NPU
Samsung S25 Ultra2.4sMobile phone
Sub-250ms on Jetson Orin enables real-time video stream processing (4+ FPS). The gated convolution layers are critical for this -- they have excellent CPU/GPU cache locality and avoid the quadratic cost of attention for local token mixing.
Per-Block Parameter Estimates (LFM2-350M Backbone)
Component Formula Parameters Notes
EmbeddingV × d67.1M65,536 × 1,024 (tied with lm_head)
Conv Block ×10
Gate Linear (B,C,h~)3 × d × d3.1MThree projections from hidden dim
Depthwise Convd × k~3K1,024 × 3 (kernel=3, depthwise)
Output Lineard × d1.0M
SwiGLU FFN3 × d × d_ff14.2Mgate + up (1024→4608) + down (4608→1024)
RMSNorm ×22 × d~2K
Conv Block Total~18.3M
× 10 blocks~183M
Attn Block ×6
Q projectiond × (heads × head_dim)1.0M1024 × (16×64)
K projectiond × (kv_groups × head_dim)0.5M1024 × (8×64)
V projectiond × (kv_groups × head_dim)0.5M1024 × (8×64)
Output projectiond × d1.0M
SwiGLU FFN3 × d × d_ff14.2MSame FFN as conv blocks
RMSNorm ×22 × d~2K
Attn Block Total~17.2M
× 6 blocks~103M
Final RMSNormd~1K
LM Backbone Total~353M67M embed + 183M conv + 103M attn
SigLIP2 Vision Encoder~86M12 ViT layers, 768 hidden, patch_size=16
Multimodal Projector~10MPixelUnshuffle + LayerNorm + 2-layer MLP (3072→2560→1024)
Grand Total~449MMatches the reported ~450M
Note: The embedding matrix is weight-tied with lm_head (tie_word_embeddings=True), so the 67M embedding parameters are not double-counted. Conv blocks are slightly larger than attention blocks per-layer because the gating mechanism requires three projections (B, C, h~) rather than Q, K, V -- but convolve at O(n) cost instead of O(n²).
Key Architectural Innovations
Deep Dive: Gated Short Convolution Blocks

The gated short convolution block is the most architecturally distinctive component of LFM2. It replaces self-attention as the primary token-mixing mechanism, handling 10 of 16 layers. The design draws on the observation that most token-to-token interactions in language are local (within a few positions), and that data-dependent gating provides sufficient expressive power for local mixing without the quadratic cost of attention.

Gating Mechanism
The block computes three projections from the input hidden state h:
  • B = LinearB(h) -- input gate
  • C = LinearC(h) -- output gate
  • h~ = Linearh(h) -- value projection
Then: y = B * h~ (element-wise gating), z = DepthwiseConv(y, k=3), output = Linear_out(C * z). The input-dependent gates B and C make the convolution effectively data-dependent -- different inputs activate different filter patterns.
Why Not Attention?
Liquid AI's hardware-in-the-loop search tested multiple alternatives:
  • All-attention (standard transformer) -- slower, higher memory
  • Linear attention -- no quality improvement
  • State-space models (S4, Mamba-style) -- no quality improvement
  • Additional convolution operators -- no quality improvement
The minimal conv + sparse GQA combination won on both quality and hardware efficiency at the 350M scale. The conv blocks have excellent cache behavior because depthwise convolution accesses contiguous memory with a small kernel.
Computational Cost
Per token, a conv block costs ~O(d²) for the three gate projections + output projection, plus O(d*k) for the depthwise convolution (k=3). This is comparable to the FFN cost but avoids the O(n*d) KV-cache read and O(n²) attention score computation. For a 32K context, this is 32,768x cheaper than the attention score computation alone. The FFN (SwiGLU) that follows each conv block has the same cost as in transformer layers.
Receptive Field
A single conv block with kernel k=3 has a receptive field of 3 tokens. After 10 stacked conv blocks, the effective receptive field grows to ~21 tokens (2*k*layers + 1, though gating makes this a soft bound). This is sufficient for local syntactic patterns but not for long-range dependency. That is why 6 GQA layers are interspersed -- they provide global attention to retrieve information beyond the conv receptive field.
# Pseudocode: Gated Short Convolution Block (from LFM2 report)
def gated_conv_block(h, W_B, W_C, W_h, W_out, conv_weight, norm):
    h = norm(h)                          # Pre-norm RMSNorm
    B = h @ W_B                          # Input gate:  (batch, seq, d)
    C = h @ W_C                          # Output gate: (batch, seq, d)
    h_tilde = h @ W_h                    # Value proj:  (batch, seq, d)
    y = B * h_tilde                      # Gated input: element-wise
    z = depthwise_conv1d(y, conv_weight) # Local mixing: kernel_size=3
    out = (C * z) @ W_out                # Gated output
    return h + out                       # Residual connection
    # Followed by: h = h + swiglu_ffn(norm(h))
Deep Dive: Vision Pipeline (SigLIP2 NaFlex + PixelUnshuffle Projector)

The vision pipeline has three stages: image preprocessing (smart resize + tiling), SigLIP2 NaFlex encoding (ViT with variable resolution support), and PixelUnshuffle + MLP projection into the LFM2 embedding space. Image features are injected into the text token sequence via placeholder-based masked scatter -- no cross-attention is used.

SigLIP2 NaFlex Encoder
  • Type: Vision Transformer (ViT)
  • Parameters: ~86M
  • Layers: 12 transformer blocks
  • Hidden size: 768
  • Attention heads: 12
  • FFN size: 3,072
  • Patch size: 16 × 16 pixels
  • Input channels: 3 (RGB)
NaFlex ("Native Flexible") preserves the original aspect ratio of input images without distortion, processing variable-resolution inputs. This is distinct from FixRes (fixed resolution) mode which resizes all images to a square.
Smart Resize + Tiling
  • Tile size: 512 × 512 pixels
  • Tiling range: 2–10 non-overlapping tiles
  • Thumbnail: Optional low-res overview for multi-tile images
  • Resize constraint: Dimensions divisible by patch_size × downsample_factor = 32
  • Token budget: min=32, max=256 per tile (tunable at inference)
  • Pixel tolerance: 2.0x max_pixels
Special tokens mark image layout: <|image_start|>, <|image_end|>, <|img_row_R_col_C|> for grid position, <|img_thumbnail|> for the global context thumbnail.
PixelUnshuffle + MLP Projector

The multimodal projector converts SigLIP2 vision features into LFM2 text embeddings in four steps:

  1. PixelUnshuffle: Reshape (B, W, H, 768) → (B, H/2, W/2, 768×4) = (B, H/2, W/2, 3072). This expands channels while reducing spatial dimensions by downsample_factor=2 in each direction, yielding a 4x token reduction. Local spatial context is preserved because adjacent patches are folded into the channel dimension rather than discarded.
  2. LayerNorm: Normalize the 3072-dim features (projector_use_layernorm=True).
  3. Linear1 + GELU: Project 3072 → 2560 (projector_hidden_size) with GELU activation.
  4. Linear2: Project 2560 → 1024 (text hidden_size).

The projected image embeddings are then scattered into the text token sequence at positions marked by image_token_id=396. This is a direct embedding replacement (masked_scatter), not cross-attention -- the LFM2 backbone's own attention and convolution layers handle all subsequent vision-language interaction.

# Pseudocode: Multimodal Projector (from modeling_lfm2_vl.py)
def pixel_unshuffle(features, factor=2):
    # features: (batch, width, height, channels)
    # Rearrange to fold spatial dims into channel dim
    B, W, H, C = features.shape
    features = features.reshape(B, W, H // factor, factor, C)
    features = features.reshape(B, W, H // factor, C * factor)
    features = features.transpose(1, 2)  # (B, H//factor, W, C*factor)
    features = features.reshape(B, H // factor, W // factor, factor, C * factor)
    features = features.reshape(B, H // factor, W // factor, C * factor * factor)
    return features  # (B, H/2, W/2, C*4)

class Lfm2VlMultiModalProjector:
    def forward(self, image_features):
        # image_features from SigLIP2: (B, H, W, 768)
        x = pixel_unshuffle(image_features, factor=2)  # -> (B, H/2, W/2, 3072)
        x = self.layer_norm(x)                          # LayerNorm
        x = gelu(self.linear_1(x))                      # 3072 -> 2560 + GELU
        x = self.linear_2(x)                            # 2560 -> 1024
        return x
Deep Dive: Training Pipeline
Pre-Training
  • Data volume: 28T tokens total (up from 10T in LFM2-VL)
  • Stage 1: 10-12T tokens at 4,096 context length
  • Stage 2: 1T additional high-quality tokens at 32,768 context (mid-training long-context extension)
  • Data mix: ~75% English, ~20% multilingual (8 languages), ~5% code
  • Distillation: Tempered, decoupled Top-K=32 from LFM1-7B teacher throughout pre-training
  • Curriculum: Ensemble of 12 models computes per-sample difficulty; training gradually introduces harder examples
Post-Training (3 Stages)
  • Stage 1 -- SFT: 5.39M samples across 67 data sources. Mix: 26.6% general-purpose, 17.1% instruction following, remainder across reasoning, grounding, multilingual, function calling
  • Stage 2 -- Preference Alignment: Length-normalized direct alignment objectives on ~700K conversations. Improves instruction following (MM-IFEval: +12 points) and reliability
  • Stage 3 -- Model Merging: Parameter-space techniques including model soup, task arithmetic, TIES, DARE, and DELLA to combine multiple post-training checkpoints without additional compute
Code-Verified Architecture Details

Verified against huggingface/transformers/models/lfm2_vl. Key source files: configuration_lfm2_vl.py, modeling_lfm2_vl.py (auto-generated from modular_lfm2_vl.py), image_processing_lfm2_vl.py, processing_lfm2_vl.py.

Property Value Source / Location
Model class hierarchyLfm2VlForConditionalGeneration extends LlavaForConditionalGeneration + GenerationMixinmodeling_lfm2_vl.py
Core model classLfm2VlModel extends LlavaModelmodeling_lfm2_vl.py
Config classLfm2VlConfigconfiguration_lfm2_vl.py
image_token_id396Lfm2VlConfig default
projector_hidden_size2560Lfm2VlConfig default
projector_hidden_act"gelu"Lfm2VlConfig default
projector_biasTrueLfm2VlConfig default
projector_use_layernormTrueLfm2VlConfig default
downsample_factor2Lfm2VlConfig default
tie_word_embeddingsTrueLfm2VlConfig default
Vision config auto-initCONFIG_MAPPING["siglip2_vision_model"]()Lfm2VlConfig.__init__
Text config auto-initCONFIG_MAPPING["lfm2"]()Lfm2VlConfig.__init__
Vision tower typeSigLIP2 (768 hidden, 12 layers, 12 heads, patch=16)siglip2_vision_model config
Vision-language fusionEmbedding replacement via masked_scatterLfm2VlModel.forward()
No cross-attentionConfirmed -- image embeddings are injected into token sequence, not attended to separatelyLfm2VlModel.forward()
Flash Attention support_supports_flash_attn = TrueLfm2VlPreTrainedModel
SDPA support_supports_sdpa = TrueLfm2VlPreTrainedModel
Flex Attention support_supports_flex_attn = TrueLfm2VlPreTrainedModel
Image splittingdo_image_splitting = True, tile_size=512, min_tiles=2, max_tiles=10Lfm2VlImageProcessor
Thumbnailuse_thumbnail = TrueLfm2VlImageProcessor
Image token rangemin=32, max=256 (user-tunable at inference)Lfm2VlImageProcessor
NormalizationImageNet standard (mean/std), rescale_factor=1/255Lfm2VlImageProcessor
KV-cache optimizationImages processed only on first iteration; cached for subsequent auto-regressive tokensprepare_inputs_for_generation()
Spatial shape trackingspatial_shapes tensor (batch, 2) preserves original image dimensions through pipelineLfm2VlModel.forward()
Generation paramstemperature=0.1, min_p=0.15, repetition_penalty=1.05HuggingFace model card
Deployment formatsNative (BF16), GGUF (Q4_0), ONNX, MLX (4/5/6/8bit, bf16)HuggingFace model card
Architecture Diagrams
Full Forward Pass: LFM2.5-VL-450M
Image Input 512x512 (tiled) Text Input token_ids (32K ctx) Smart Resize + Tile 2-10 tiles @ 512x512 SigLIP2 NaFlex 12 ViT layers, 768d ~86M params PixelUnshuffle (2x) (H,W,768) -> (H/2,W/2,3072) MLP Projector 3072 -> 2560 -> 1024 Masked Scatter Replace image_token_id=396 with projected image embeds Token Embedding 65,536 x 1,024 LFM2.5-350M Backbone Gated Conv Block x10 B,C,h~ = Linear(h) z = DepthConv(B*h~, k=3) out = Linear(C*z) + SwiGLU FFN GQA Block x6 16 heads, 8 KV groups head_dim=64, QK-Norm, RoPE + SwiGLU FFN (4608) Interleaved: conv and attn layers mixed Pre-norm RMSNorm throughout RMSNorm -> LM Head 1024 -> 65,536 (tied) Output Logits Gated Conv (10 layers, O(n) token mixing) GQA Attention (6 layers, long-range retrieval) Vision pipeline (SigLIP2 NaFlex + projector)
Layer Stack Detail: 16-Layer Hybrid Backbone
LFM2.5-350M: 16-Layer Hybrid Stack (Interleaved Conv + Attention) L0 Conv Gate SwiGLU FFN 18.3M L1 Conv 18.3M L2 GQA 16h/8kv SwiGLU FFN 17.2M L3 Conv 18.3M L4 Conv 18.3M L5 GQA 17.2M L6 Conv 18.3M L7 Conv 18.3M L8 GQA 17.2M L9 Conv L10 Conv L11 GQA L12 Conv L13 Conv L14 GQA L15 GQA Pattern: roughly every 2-3 conv blocks, one GQA attention block is inserted for long-range retrieval Exact layer assignment determined by hardware-in-the-loop architecture search Gated Conv Block (10x) GQA Attention Block (6x)
MADL Architecture Diagram

Rendered live from the MADL source below. The MADL string declares the architecture; the JavaScript parser interprets it as a vertical block stack with attention/conv substructure expanded inline. LFM2.5-VL uses a hybrid layer declaration to express the interleaved conv+attention stack.

Analysis based on arXiv:2511.23404, HuggingFace model card, Liquid AI blog, and transformers source code. Layer interleaving pattern is approximate -- exact assignment is determined by Liquid AI's hardware-aware search and not published in detail.
Generated from MADL Architecture Browser
MIT License · Remek Kinas