LFM2.5-VL-450M is Liquid AI's second-generation vision-language model, built on a fundamentally different backbone from standard transformers. Where most VLMs stack decoder-only transformer blocks (GPT, Gemma, Qwen), LFM2.5-VL uses a minimal hybrid of gated short convolution blocks and grouped-query attention blocks as its language backbone (LFM2-350M), with a SigLIP2 NaFlex vision encoder (86M) projecting image features into the text embedding space. The result is a 450M-parameter VLM that runs inference in 242ms on Jetson Orin -- fast enough for real-time edge deployment -- while matching or exceeding the benchmark scores of similarly-sized transformer-based VLMs like SmolVLM2-500M.
The LFM2-350M language model at the core of this system is not a decoder-only transformer. It uses only 6 attention layers out of 16 total. The remaining 10 layers are gated short convolution blocks -- a much cheaper token-mixing mechanism that uses depthwise convolutions (kernel size 3) with input-dependent multiplicative gating. This design was found through hardware-in-the-loop search: Liquid AI systematically tested adding linear attention, state-space models, and extra convolution operators to these stacks and found that none improved aggregate quality over the minimal conv+attention hybrid.
B,C,h~ = Linear(h), y = B*h~, z = Conv_k(y), o = Linear(C*z). Kernel k=3. Fast local mixing with excellent CPU cache behavior. SwiGLU FFN after each.
16 query heads, 8 KV groups, head_dim=64. QK-Norm augmented. RoPE positional encoding. Handles long-range retrieval that convolutions cannot.
The architectural insight is that most token mixing in language does not require quadratic-cost attention -- local convolutions handle it with O(n) cost and better cache locality. Attention is only injected where long-range dependency retrieval is empirically necessary. This yields 2x faster prefill/decode and a smaller KV cache vs. a similarly-sized all-attention model.
LFM2.5-VL is a post-training improvement over LFM2-VL using the same 450M architecture. The changes are entirely in training, not architecture:
| Component | LFM2.5-VL-450M | SmolVLM2-500M | Notes |
|---|---|---|---|
| Architecture | Hybrid Conv + GQA | Transformer-only | LFM2 replaces 62.5% of attention layers with gated convolutions |
| Total Parameters | ~450M | ~500M | LFM2.5 is smaller yet outperforms on most benchmarks |
| LM Backbone | LFM2.5-350M | Transformer ~400M | LFM2.5 uses only 6 attention layers out of 16 |
| Vision Encoder | SigLIP2 NaFlex 86M | SigLIP-based | Both use SigLIP family; LFM2.5 uses newer NaFlex variant |
| Hidden Size | 1,024 | -- | |
| Num Layers | 16 (10 conv + 6 attn) | All attention | Mixed layer types are the core architectural difference |
| Attention Heads | 16 (8 KV groups) | -- | GQA with 2:1 query-to-KV ratio |
| Head Dimension | 64 | -- | |
| FFN Dimension | 4,608 (SwiGLU) | -- | 4.5x hidden size, SwiGLU activation |
| Conv Kernel Size | 3 | N/A | Depthwise convolution in gated conv blocks |
| Context Length | 32,768 | -- | |
| Vocab Size | 65,536 | -- | Byte-level BPE tokenizer |
| Projector Hidden | 2,560 | -- | PixelUnshuffle + 2-layer MLP |
| Downsample Factor | 2 (4x token reduction) | -- | PixelUnshuffle reduces vision tokens by factor 4 |
| Normalization | Pre-norm RMSNorm | -- | Throughout all layers |
| Positional Encoding | RoPE (attn only) | -- | Conv blocks have no explicit positional encoding |
| Training Data | 28T tokens | -- | Pre-training at 4K context + 1T at 32K context |
| Distillation | LFM1-7B teacher | -- | Top-K=32 tempered decoupled knowledge distillation |
| Benchmark | LFM2.5-VL-450M | LFM2-VL-450M | SmolVLM2-500M | Category |
|---|---|---|---|---|
| MMStar | 43.00 | 40.87 | 38.20 | General VL understanding |
| RealWorldQA | 58.43 | 52.03 | 49.90 | Real-world visual QA |
| MMBench (dev en) | 60.91 | 56.27 | 52.32 | Comprehensive VL benchmark |
| POPE | 86.93 | 83.79 | 82.67 | Object hallucination |
| MMVet | 41.10 | 33.85 | 29.90 | Multi-modal conversation |
| OCRBench | 684 | 657 | 609 | OCR capability |
| RefCOCO-M | 81.28 | -- | -- | Visual grounding (bboxes) |
| MMMB (multilingual) | 68.09 | 54.29 | 46.79 | Multilingual VL |
| MM-IFEval | 45.00 | 32.93 | 11.27 | Instruction following |
| BFCLv4 (function call) | 21.08 | -- | -- | Tool use / function calling |
| Device | Latency | Category |
|---|---|---|
| Jetson Orin | 242ms | Embedded GPU (robotics, IoT) |
| AMD Ryzen AI Max+ 395 | 944ms | Laptop NPU |
| Samsung S25 Ultra | 2.4s | Mobile phone |
| Component | Formula | Parameters | Notes |
|---|---|---|---|
| Embedding | V × d | 67.1M | 65,536 × 1,024 (tied with lm_head) |
| Conv Block ×10 | |||
| Gate Linear (B,C,h~) | 3 × d × d | 3.1M | Three projections from hidden dim |
| Depthwise Conv | d × k | ~3K | 1,024 × 3 (kernel=3, depthwise) |
| Output Linear | d × d | 1.0M | |
| SwiGLU FFN | 3 × d × d_ff | 14.2M | gate + up (1024→4608) + down (4608→1024) |
| RMSNorm ×2 | 2 × d | ~2K | |
| Conv Block Total | ~18.3M | ||
| × 10 blocks | ~183M | ||
| Attn Block ×6 | |||
| Q projection | d × (heads × head_dim) | 1.0M | 1024 × (16×64) |
| K projection | d × (kv_groups × head_dim) | 0.5M | 1024 × (8×64) |
| V projection | d × (kv_groups × head_dim) | 0.5M | 1024 × (8×64) |
| Output projection | d × d | 1.0M | |
| SwiGLU FFN | 3 × d × d_ff | 14.2M | Same FFN as conv blocks |
| RMSNorm ×2 | 2 × d | ~2K | |
| Attn Block Total | ~17.2M | ||
| × 6 blocks | ~103M | ||
| Final RMSNorm | d | ~1K | |
| LM Backbone Total | ~353M | 67M embed + 183M conv + 103M attn | |
| SigLIP2 Vision Encoder | ~86M | 12 ViT layers, 768 hidden, patch_size=16 | |
| Multimodal Projector | ~10M | PixelUnshuffle + LayerNorm + 2-layer MLP (3072→2560→1024) | |
| Grand Total | ~449M | Matches the reported ~450M |
tie_word_embeddings=True), so the 67M embedding parameters are not double-counted. Conv blocks are slightly larger than attention blocks per-layer because the gating mechanism requires three projections (B, C, h~) rather than Q, K, V -- but convolve at O(n) cost instead of O(n²).o = Linear(C * Conv_k(B * h~)) where B, C, h~ are all linear projections of the input provides data-dependent routing at O(n) cost. Hardware-in-the-loop search confirmed this beats SSMs, linear attention, and extra convolution variants on this model size.
encoder_patch_size * downsample_factor = 32. Large images are tiled into non-overlapping 512x512 patches (2-10 tiles) with an optional low-resolution thumbnail for global context. Token budget is tunable at inference: min_image_tokens=32, max_image_tokens=256.
The gated short convolution block is the most architecturally distinctive component of LFM2. It replaces self-attention as the primary token-mixing mechanism, handling 10 of 16 layers. The design draws on the observation that most token-to-token interactions in language are local (within a few positions), and that data-dependent gating provides sufficient expressive power for local mixing without the quadratic cost of attention.
y = B * h~ (element-wise gating), z = DepthwiseConv(y, k=3), output = Linear_out(C * z). The input-dependent gates B and C make the convolution effectively data-dependent -- different inputs activate different filter patterns.
# Pseudocode: Gated Short Convolution Block (from LFM2 report)
def gated_conv_block(h, W_B, W_C, W_h, W_out, conv_weight, norm):
h = norm(h) # Pre-norm RMSNorm
B = h @ W_B # Input gate: (batch, seq, d)
C = h @ W_C # Output gate: (batch, seq, d)
h_tilde = h @ W_h # Value proj: (batch, seq, d)
y = B * h_tilde # Gated input: element-wise
z = depthwise_conv1d(y, conv_weight) # Local mixing: kernel_size=3
out = (C * z) @ W_out # Gated output
return h + out # Residual connection
# Followed by: h = h + swiglu_ffn(norm(h))
The vision pipeline has three stages: image preprocessing (smart resize + tiling), SigLIP2 NaFlex encoding (ViT with variable resolution support), and PixelUnshuffle + MLP projection into the LFM2 embedding space. Image features are injected into the text token sequence via placeholder-based masked scatter -- no cross-attention is used.
patch_size × downsample_factor = 32<|image_start|>, <|image_end|>, <|img_row_R_col_C|> for grid position, <|img_thumbnail|> for the global context thumbnail.
The multimodal projector converts SigLIP2 vision features into LFM2 text embeddings in four steps:
downsample_factor=2 in each direction, yielding a 4x token reduction. Local spatial context is preserved because adjacent patches are folded into the channel dimension rather than discarded.projector_use_layernorm=True).projector_hidden_size) with GELU activation.hidden_size).The projected image embeddings are then scattered into the text token sequence at positions marked by image_token_id=396. This is a direct embedding replacement (masked_scatter), not cross-attention -- the LFM2 backbone's own attention and convolution layers handle all subsequent vision-language interaction.
# Pseudocode: Multimodal Projector (from modeling_lfm2_vl.py)
def pixel_unshuffle(features, factor=2):
# features: (batch, width, height, channels)
# Rearrange to fold spatial dims into channel dim
B, W, H, C = features.shape
features = features.reshape(B, W, H // factor, factor, C)
features = features.reshape(B, W, H // factor, C * factor)
features = features.transpose(1, 2) # (B, H//factor, W, C*factor)
features = features.reshape(B, H // factor, W // factor, factor, C * factor)
features = features.reshape(B, H // factor, W // factor, C * factor * factor)
return features # (B, H/2, W/2, C*4)
class Lfm2VlMultiModalProjector:
def forward(self, image_features):
# image_features from SigLIP2: (B, H, W, 768)
x = pixel_unshuffle(image_features, factor=2) # -> (B, H/2, W/2, 3072)
x = self.layer_norm(x) # LayerNorm
x = gelu(self.linear_1(x)) # 3072 -> 2560 + GELU
x = self.linear_2(x) # 2560 -> 1024
return x
Verified against huggingface/transformers/models/lfm2_vl. Key source files: configuration_lfm2_vl.py, modeling_lfm2_vl.py (auto-generated from modular_lfm2_vl.py), image_processing_lfm2_vl.py, processing_lfm2_vl.py.
| Property | Value | Source / Location |
|---|---|---|
| Model class hierarchy | Lfm2VlForConditionalGeneration extends LlavaForConditionalGeneration + GenerationMixin | modeling_lfm2_vl.py |
| Core model class | Lfm2VlModel extends LlavaModel | modeling_lfm2_vl.py |
| Config class | Lfm2VlConfig | configuration_lfm2_vl.py |
| image_token_id | 396 | Lfm2VlConfig default |
| projector_hidden_size | 2560 | Lfm2VlConfig default |
| projector_hidden_act | "gelu" | Lfm2VlConfig default |
| projector_bias | True | Lfm2VlConfig default |
| projector_use_layernorm | True | Lfm2VlConfig default |
| downsample_factor | 2 | Lfm2VlConfig default |
| tie_word_embeddings | True | Lfm2VlConfig default |
| Vision config auto-init | CONFIG_MAPPING["siglip2_vision_model"]() | Lfm2VlConfig.__init__ |
| Text config auto-init | CONFIG_MAPPING["lfm2"]() | Lfm2VlConfig.__init__ |
| Vision tower type | SigLIP2 (768 hidden, 12 layers, 12 heads, patch=16) | siglip2_vision_model config |
| Vision-language fusion | Embedding replacement via masked_scatter | Lfm2VlModel.forward() |
| No cross-attention | Confirmed -- image embeddings are injected into token sequence, not attended to separately | Lfm2VlModel.forward() |
| Flash Attention support | _supports_flash_attn = True | Lfm2VlPreTrainedModel |
| SDPA support | _supports_sdpa = True | Lfm2VlPreTrainedModel |
| Flex Attention support | _supports_flex_attn = True | Lfm2VlPreTrainedModel |
| Image splitting | do_image_splitting = True, tile_size=512, min_tiles=2, max_tiles=10 | Lfm2VlImageProcessor |
| Thumbnail | use_thumbnail = True | Lfm2VlImageProcessor |
| Image token range | min=32, max=256 (user-tunable at inference) | Lfm2VlImageProcessor |
| Normalization | ImageNet standard (mean/std), rescale_factor=1/255 | Lfm2VlImageProcessor |
| KV-cache optimization | Images processed only on first iteration; cached for subsequent auto-regressive tokens | prepare_inputs_for_generation() |
| Spatial shape tracking | spatial_shapes tensor (batch, 2) preserves original image dimensions through pipeline | Lfm2VlModel.forward() |
| Generation params | temperature=0.1, min_p=0.15, repetition_penalty=1.05 | HuggingFace model card |
| Deployment formats | Native (BF16), GGUF (Q4_0), ONNX, MLX (4/5/6/8bit, bf16) | HuggingFace model card |
Rendered live from the MADL source below. The MADL string declares the architecture; the JavaScript parser interprets it as a vertical block stack with attention/conv substructure expanded inline. LFM2.5-VL uses a hybrid layer declaration to express the interleaved conv+attention stack.