LFM2.5-VL-450M Architecture -- Complete Technical Analysis

LFM2.5-VL-450M is Liquid AI's second-generation vision-language model, built on a fundamentally different backbone from standard transformers. Where most VLMs stack decoder-only transformer blocks (GPT, Gemma, Qwen), LFM2.5-VL uses a minimal hybrid of gated short convolution blocks and grouped-query attention blocks as its language backbone (LFM2-350M), with a SigLIP2 NaFlex vision encoder (86M) projecting image features into the text embedding space. The result is a 450M-parameter VLM that runs inference in 242ms on Jetson Orin -- fast enough for real-time edge deployment -- while matching or exceeding the benchmark scores of similarly-sized transformer-based VLMs like SmolVLM2-500M.

The Hybrid Backbone: Not a Transformer

The LFM2-350M language model at the core of this system is not a decoder-only transformer. It uses only 6 attention layers out of 16 total. The remaining 10 layers are gated short convolution blocks -- a much cheaper token-mixing mechanism that uses depthwise convolutions (kernel size 3) with input-dependent multiplicative gating. This design was found through hardware-in-the-loop search: Liquid AI systematically tested adding linear attention, state-space models, and extra convolution operators to these stacks and found that none improved aggregate quality over the minimal conv+attention hybrid.

Gated Conv Blocks (10 of 16 layers)

Input-dependent gating: B,C,h~ = Linear(h), y = B*h~, z = Conv_k(y), o = Linear(C*z). Kernel k=3. Fast local mixing with excellent CPU cache behavior. SwiGLU FFN after each.

GQA Blocks (6 of 16 layers)

Grouped-query attention with 16 query heads, 8 KV groups, head_dim=64. QK-Norm augmented. RoPE positional encoding. Handles long-range retrieval that convolutions cannot.

The architectural insight is that most token mixing in language does not require quadratic-cost attention -- local convolutions handle it with O(n) cost and better cache locality. Attention is only injected where long-range dependency retrieval is empirically necessary. This yields 2x faster prefill/decode and a smaller KV cache vs. a similarly-sized all-attention model.

What Changed from LFM2-VL to LFM2.5-VL

LFM2.5-VL is a post-training improvement over LFM2-VL using the same 450M architecture. The changes are entirely in training, not architecture:

Training data scaled from 10T to 28T tokens -- nearly 3x more pre-training data.
Preference optimization + RL post-training -- adds instruction-following, grounding, and reliability improvements.
Bounding box prediction -- new capability for object detection with spatial grounding (RefCOCO-M: 81.28).
Function calling support -- text-only tool-use capability (BFCLv4: 21.08).
Multilingual expansion to 9 languages -- English plus Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Spanish. MMMB improved from 54.29 to 68.09.
Instruction following dramatically improved -- MM-IFEval jumped from 32.93 to 45.00.

Model Overview

LFM2.5-VL-450M

Hybrid Conv+AttnVisionEdge

Total Params~450M

LM BackboneLFM2.5-350M (16 layers)

Vision EncoderSigLIP2 NaFlex (~86M)

Context32,768 tokens

Vocab Size65,536 (byte-level BPE)

Hidden Size1,024

Layers16 (10 conv + 6 attn)

Attention Heads16 (8 KV groups, GQA)

FFN Size4,608 (SwiGLU)

Head Dim64

Max Image Res512x512 (tiled up to 10 tiles)

Training Data28T tokens

SmolVLM2-500M-Video-Instruct comparison

TransformerVision

Total Params~500M

LM BackboneVLlama3 (32 layers)

Vision EncoderSigLIP (google/siglip-base-patch16-512)

Context8,192 tokens

Vocab Size49,280

Hidden Size960

Layers32 (all attention)

Attention Heads15 (5 KV heads, GQA)

FFN Size2,560

Head Dim64

Resampler6 layers, 16 heads, 64 latents

LFM2-VL-450M predecessor

Hybrid Conv+AttnVisionEdge

Total Params~450M

LM BackboneLFM2-350M (16 layers)

Vision EncoderSigLIP2 NaFlex (~86M)

Hidden Size1,024

Layers16 (10 conv + 6 attn)

Training Data10T tokens

Parameter Comparison

Component	LFM2.5-VL-450M	SmolVLM2-500M	Notes
Architecture	Hybrid Conv + GQA	Transformer-only	LFM2 replaces 62.5% of attention layers with gated convolutions
Total Parameters	~450M	~500M	LFM2.5 is smaller yet outperforms on most benchmarks
LM Backbone	LFM2.5-350M	Transformer ~400M	LFM2.5 uses only 6 attention layers out of 16
Vision Encoder	SigLIP2 NaFlex 86M	SigLIP-based	Both use SigLIP family; LFM2.5 uses newer NaFlex variant
Hidden Size	1,024	--
Num Layers	16 (10 conv + 6 attn)	All attention	Mixed layer types are the core architectural difference
Attention Heads	16 (8 KV groups)	--	GQA with 2:1 query-to-KV ratio
Head Dimension	64	--
FFN Dimension	4,608 (SwiGLU)	--	4.5x hidden size, SwiGLU activation
Conv Kernel Size	3	N/A	Depthwise convolution in gated conv blocks
Context Length	32,768	--
Vocab Size	65,536	--	Byte-level BPE tokenizer
Projector Hidden	2,560	--	PixelUnshuffle + 2-layer MLP
Downsample Factor	2 (4x token reduction)	--	PixelUnshuffle reduces vision tokens by factor 4
Normalization	Pre-norm RMSNorm	--	Throughout all layers
Positional Encoding	RoPE (attn only)	--	Conv blocks have no explicit positional encoding
Training Data	28T tokens	--	Pre-training at 4K context + 1T at 32K context
Distillation	LFM1-7B teacher	--	Top-K=32 tempered decoupled knowledge distillation

Benchmarks

Benchmark	LFM2.5-VL-450M	LFM2-VL-450M	SmolVLM2-500M	Category
MMStar	43.00	40.87	38.20	General VL understanding
RealWorldQA	58.43	52.03	49.90	Real-world visual QA
MMBench (dev en)	60.91	56.27	52.32	Comprehensive VL benchmark
POPE	86.93	83.79	82.67	Object hallucination
MMVet	41.10	33.85	29.90	Multi-modal conversation
OCRBench	684	657	609	OCR capability
RefCOCO-M	81.28	--	--	Visual grounding (bboxes)
MMMB (multilingual)	68.09	54.29	46.79	Multilingual VL
MM-IFEval	45.00	32.93	11.27	Instruction following
BFCLv4 (function call)	21.08	--	--	Tool use / function calling

LFM2.5-VL-450M leads on every benchmark. The largest improvements over the predecessor come from instruction following (MM-IFEval +12.07), multilingual understanding (MMMB +13.80), and conversational capability (MMVet +7.25). Against SmolVLM2-500M (a transformer-based model with ~10% more parameters), the advantage is consistent and often dramatic -- especially on instruction following (45.00 vs 11.27).

Edge Hardware Performance (512x512 input)

Device	Latency	Category
Jetson Orin	242ms	Embedded GPU (robotics, IoT)
AMD Ryzen AI Max+ 395	944ms	Laptop NPU
Samsung S25 Ultra	2.4s	Mobile phone

Sub-250ms on Jetson Orin enables real-time video stream processing (4+ FPS). The gated convolution layers are critical for this -- they have excellent CPU/GPU cache locality and avoid the quadratic cost of attention for local token mixing.

Per-Block Parameter Estimates (LFM2-350M Backbone)

Component	Formula	Parameters	Notes
Embedding	V × d	67.1M	65,536 × 1,024 (tied with lm_head)
Conv Block ×10
Gate Linear (B,C,h~)	3 × d × d	3.1M	Three projections from hidden dim
Depthwise Conv	d × k	~3K	1,024 × 3 (kernel=3, depthwise)
Output Linear	d × d	1.0M
SwiGLU FFN	3 × d × d_ff	14.2M	gate + up (1024→4608) + down (4608→1024)
RMSNorm ×2	2 × d	~2K
Conv Block Total		~18.3M
× 10 blocks		~183M
Attn Block ×6
Q projection	d × (heads × head_dim)	1.0M	1024 × (16×64)
K projection	d × (kv_groups × head_dim)	0.5M	1024 × (8×64)
V projection	d × (kv_groups × head_dim)	0.5M	1024 × (8×64)
Output projection	d × d	1.0M
SwiGLU FFN	3 × d × d_ff	14.2M	Same FFN as conv blocks
RMSNorm ×2	2 × d	~2K
Attn Block Total		~17.2M
× 6 blocks		~103M
Final RMSNorm	d	~1K
LM Backbone Total		~353M	67M embed + 183M conv + 103M attn
SigLIP2 Vision Encoder		~86M	12 ViT layers, 768 hidden, patch_size=16
Multimodal Projector		~10M	PixelUnshuffle + LayerNorm + 2-layer MLP (3072→2560→1024)
Grand Total		~449M	Matches the reported ~450M

Note: The embedding matrix is weight-tied with lm_head (tie_word_embeddings=True), so the 67M embedding parameters are not double-counted. Conv blocks are slightly larger than attention blocks per-layer because the gating mechanism requires three projections (B, C, h~) rather than Q, K, V -- but convolve at O(n) cost instead of O(n²).

Key Architectural Innovations

Gated Short Convolution as Primary Token Mixer 62.5% of layers use depthwise convolution with input-dependent multiplicative gating instead of attention. The gating formula o = Linear(C * Conv_k(B * h~)) where B, C, h~ are all linear projections of the input provides data-dependent routing at O(n) cost. Hardware-in-the-loop search confirmed this beats SSMs, linear attention, and extra convolution variants on this model size.
Minimal Hybrid Design (37.5% Attention) Only 6 of 16 layers use attention, placed to provide long-range retrieval where convolutions fail. This is architecturally distinct from Mamba (100% SSM), Jamba (interleaved SSM+attention), or standard transformers (100% attention). The ratio was determined empirically -- more attention layers did not improve quality on this model scale.
PixelUnshuffle Vision-Language Projection Instead of flattening vision features or using cross-attention, LFM2.5-VL applies a PixelUnshuffle operation that reshapes (B, W, H, C) to (B, H/2, W/2, C*4), expanding channel dimension while preserving spatial locality. This feeds into a 2-layer MLP projector (3072 → 2560 → 1024). The approach reduces visual tokens by 4x while keeping local spatial relationships intact.
NaFlex Variable-Resolution Vision SigLIP2 NaFlex supports native aspect ratios without distortion: images are smart-resized to maintain dimensions divisible by encoder_patch_size * downsample_factor = 32. Large images are tiled into non-overlapping 512x512 patches (2-10 tiles) with an optional low-resolution thumbnail for global context. Token budget is tunable at inference: min_image_tokens=32, max_image_tokens=256.
Tempered Decoupled Knowledge Distillation The 350M backbone was trained with knowledge distillation from a 7B teacher (LFM1-7B) using Top-K=32 tempered decoupled distillation. This is a major contributor to the model punching above its weight class -- the small model learns soft targets from a 20x larger teacher.
3-Stage Post-Training Pipeline Stage 1: SFT on 5.39M samples across 67 sources. Stage 2: Length-normalized preference alignment on ~700K conversations. Stage 3: Model merging using parameter-space techniques (soup, task arithmetic, TIES, DARE, DELLA). This is one of the most aggressive post-training pipelines for a sub-500M model.

Deep Dive: Gated Short Convolution Blocks

The gated short convolution block is the most architecturally distinctive component of LFM2. It replaces self-attention as the primary token-mixing mechanism, handling 10 of 16 layers. The design draws on the observation that most token-to-token interactions in language are local (within a few positions), and that data-dependent gating provides sufficient expressive power for local mixing without the quadratic cost of attention.

Gating Mechanism

The block computes three projections from the input hidden state h:

B = Linear_B(h) -- input gate
C = Linear_C(h) -- output gate
h~ = Linear_h(h) -- value projection

Then: y = B * h~ (element-wise gating), z = DepthwiseConv(y, k=3), output = Linear_out(C * z). The input-dependent gates B and C make the convolution effectively data-dependent -- different inputs activate different filter patterns.

Why Not Attention?

Liquid AI's hardware-in-the-loop search tested multiple alternatives:

All-attention (standard transformer) -- slower, higher memory
Linear attention -- no quality improvement
State-space models (S4, Mamba-style) -- no quality improvement
Additional convolution operators -- no quality improvement

The minimal conv + sparse GQA combination won on both quality and hardware efficiency at the 350M scale. The conv blocks have excellent cache behavior because depthwise convolution accesses contiguous memory with a small kernel.

Computational Cost

Per token, a conv block costs ~O(d²) for the three gate projections + output projection, plus O(d*k) for the depthwise convolution (k=3). This is comparable to the FFN cost but avoids the O(n*d) KV-cache read and O(n²) attention score computation. For a 32K context, this is 32,768x cheaper than the attention score computation alone. The FFN (SwiGLU) that follows each conv block has the same cost as in transformer layers.

Receptive Field

A single conv block with kernel k=3 has a receptive field of 3 tokens. After 10 stacked conv blocks, the effective receptive field grows to ~21 tokens (2*k*layers + 1, though gating makes this a soft bound). This is sufficient for local syntactic patterns but not for long-range dependency. That is why 6 GQA layers are interspersed -- they provide global attention to retrieve information beyond the conv receptive field.

# Pseudocode: Gated Short Convolution Block (from LFM2 report)
def gated_conv_block(h, W_B, W_C, W_h, W_out, conv_weight, norm):
    h = norm(h)                          # Pre-norm RMSNorm
    B = h @ W_B                          # Input gate:  (batch, seq, d)
    C = h @ W_C                          # Output gate: (batch, seq, d)
    h_tilde = h @ W_h                    # Value proj:  (batch, seq, d)
    y = B * h_tilde                      # Gated input: element-wise
    z = depthwise_conv1d(y, conv_weight) # Local mixing: kernel_size=3
    out = (C * z) @ W_out                # Gated output
    return h + out                       # Residual connection
    # Followed by: h = h + swiglu_ffn(norm(h))

Deep Dive: Vision Pipeline (SigLIP2 NaFlex + PixelUnshuffle Projector)

The vision pipeline has three stages: image preprocessing (smart resize + tiling), SigLIP2 NaFlex encoding (ViT with variable resolution support), and PixelUnshuffle + MLP projection into the LFM2 embedding space. Image features are injected into the text token sequence via placeholder-based masked scatter -- no cross-attention is used.

SigLIP2 NaFlex Encoder

Type: Vision Transformer (ViT)
Parameters: ~86M
Layers: 12 transformer blocks
Hidden size: 768
Attention heads: 12
FFN size: 3,072
Patch size: 16 × 16 pixels
Input channels: 3 (RGB)

NaFlex ("Native Flexible") preserves the original aspect ratio of input images without distortion, processing variable-resolution inputs. This is distinct from FixRes (fixed resolution) mode which resizes all images to a square.

Smart Resize + Tiling

Tile size: 512 × 512 pixels
Tiling range: 2–10 non-overlapping tiles
Thumbnail: Optional low-res overview for multi-tile images
Resize constraint: Dimensions divisible by patch_size × downsample_factor = 32
Token budget: min=32, max=256 per tile (tunable at inference)
Pixel tolerance: 2.0x max_pixels

PixelUnshuffle + MLP Projector

The multimodal projector converts SigLIP2 vision features into LFM2 text embeddings in four steps:

PixelUnshuffle: Reshape (B, W, H, 768) → (B, H/2, W/2, 768×4) = (B, H/2, W/2, 3072). This expands channels while reducing spatial dimensions by downsample_factor=2 in each direction, yielding a 4x token reduction. Local spatial context is preserved because adjacent patches are folded into the channel dimension rather than discarded.
LayerNorm: Normalize the 3072-dim features (projector_use_layernorm=True).
Linear₁ + GELU: Project 3072 → 2560 (projector_hidden_size) with GELU activation.
Linear₂: Project 2560 → 1024 (text hidden_size).

The projected image embeddings are then scattered into the text token sequence at positions marked by image_token_id=396. This is a direct embedding replacement (masked_scatter), not cross-attention -- the LFM2 backbone's own attention and convolution layers handle all subsequent vision-language interaction.

# Pseudocode: Multimodal Projector (from modeling_lfm2_vl.py)
def pixel_unshuffle(features, factor=2):
    # features: (batch, width, height, channels)
    # Rearrange to fold spatial dims into channel dim
    B, W, H, C = features.shape
    features = features.reshape(B, W, H // factor, factor, C)
    features = features.reshape(B, W, H // factor, C * factor)
    features = features.transpose(1, 2)  # (B, H//factor, W, C*factor)
    features = features.reshape(B, H // factor, W // factor, factor, C * factor)
    features = features.reshape(B, H // factor, W // factor, C * factor * factor)
    return features  # (B, H/2, W/2, C*4)

class Lfm2VlMultiModalProjector:
    def forward(self, image_features):
        # image_features from SigLIP2: (B, H, W, 768)
        x = pixel_unshuffle(image_features, factor=2)  # -> (B, H/2, W/2, 3072)
        x = self.layer_norm(x)                          # LayerNorm
        x = gelu(self.linear_1(x))                      # 3072 -> 2560 + GELU
        x = self.linear_2(x)                            # 2560 -> 1024
        return x

Deep Dive: Training Pipeline

Pre-Training

Data volume: 28T tokens total (up from 10T in LFM2-VL)
Stage 1: 10-12T tokens at 4,096 context length
Stage 2: 1T additional high-quality tokens at 32,768 context (mid-training long-context extension)
Data mix: ~75% English, ~20% multilingual (8 languages), ~5% code
Distillation: Tempered, decoupled Top-K=32 from LFM1-7B teacher throughout pre-training
Curriculum: Ensemble of 12 models computes per-sample difficulty; training gradually introduces harder examples

Post-Training (3 Stages)

Stage 1 -- SFT: 5.39M samples across 67 data sources. Mix: 26.6% general-purpose, 17.1% instruction following, remainder across reasoning, grounding, multilingual, function calling
Stage 2 -- Preference Alignment: Length-normalized direct alignment objectives on ~700K conversations. Improves instruction following (MM-IFEval: +12 points) and reliability
Stage 3 -- Model Merging: Parameter-space techniques including model soup, task arithmetic, TIES, DARE, and DELLA to combine multiple post-training checkpoints without additional compute

Code-Verified Architecture Details

Verified against huggingface/transformers/models/lfm2_vl. Key source files: configuration_lfm2_vl.py, modeling_lfm2_vl.py (auto-generated from modular_lfm2_vl.py), image_processing_lfm2_vl.py, processing_lfm2_vl.py.

Property	Value	Source / Location
Model class hierarchy	`Lfm2VlForConditionalGeneration` extends `LlavaForConditionalGeneration` + `GenerationMixin`	modeling_lfm2_vl.py
Core model class	`Lfm2VlModel` extends `LlavaModel`	modeling_lfm2_vl.py
Config class	`Lfm2VlConfig`	configuration_lfm2_vl.py
image_token_id	`396`	Lfm2VlConfig default
projector_hidden_size	`2560`	Lfm2VlConfig default
projector_hidden_act	`"gelu"`	Lfm2VlConfig default
projector_bias	`True`	Lfm2VlConfig default
projector_use_layernorm	`True`	Lfm2VlConfig default
downsample_factor	`2`	Lfm2VlConfig default
tie_word_embeddings	`True`	Lfm2VlConfig default
Vision config auto-init	`CONFIG_MAPPING["siglip2_vision_model"]()`	Lfm2VlConfig.__init__
Text config auto-init	`CONFIG_MAPPING["lfm2"]()`	Lfm2VlConfig.__init__
Vision tower type	SigLIP2 (768 hidden, 12 layers, 12 heads, patch=16)	siglip2_vision_model config
Vision-language fusion	Embedding replacement via `masked_scatter`	Lfm2VlModel.forward()
No cross-attention	Confirmed -- image embeddings are injected into token sequence, not attended to separately	Lfm2VlModel.forward()
Flash Attention support	`_supports_flash_attn = True`	Lfm2VlPreTrainedModel
SDPA support	`_supports_sdpa = True`	Lfm2VlPreTrainedModel
Flex Attention support	`_supports_flex_attn = True`	Lfm2VlPreTrainedModel
Image splitting	`do_image_splitting = True`, tile_size=512, min_tiles=2, max_tiles=10	Lfm2VlImageProcessor
Thumbnail	`use_thumbnail = True`	Lfm2VlImageProcessor
Image token range	min=32, max=256 (user-tunable at inference)	Lfm2VlImageProcessor
Normalization	ImageNet standard (mean/std), rescale_factor=1/255	Lfm2VlImageProcessor
KV-cache optimization	Images processed only on first iteration; cached for subsequent auto-regressive tokens	prepare_inputs_for_generation()
Spatial shape tracking	`spatial_shapes` tensor (batch, 2) preserves original image dimensions through pipeline	Lfm2VlModel.forward()
Generation params	temperature=0.1, min_p=0.15, repetition_penalty=1.05	HuggingFace model card
Deployment formats	Native (BF16), GGUF (Q4_0), ONNX, MLX (4/5/6/8bit, bf16)	HuggingFace model card

Architecture Diagrams

Full Forward Pass: LFM2.5-VL-450M

Layer Stack Detail: 16-Layer Hybrid Backbone

MADL Architecture Diagram

Rendered live from the MADL source below. The MADL string declares the architecture; the JavaScript parser interprets it as a vertical block stack with attention/conv substructure expanded inline. LFM2.5-VL uses a hybrid layer declaration to express the interleaved conv+attention stack.

Analysis based on arXiv:2511.23404, HuggingFace model card, Liquid AI blog, and transformers source code. Layer interleaving pattern is approximate -- exact assignment is determined by Liquid AI's hardware-aware search and not published in detail.

Generated from MADL Architecture Browser

MIT License · Remek Kinas

LFM2.5-VL-450M ARCHITECTURE

The Hybrid Backbone: Not a Transformer

What Changed from LFM2-VL to LFM2.5-VL