MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
Published in Submitted to CVPR 2026, 2025
Abstract
Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce MICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present Dynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence.
Highlights
- 🔍 Six multi-image tasks covering object composition, spatial reasoning, attribute disentanglement, fg/bg swapping, component transfer, and story generation.
- 🤖 Evaluation-by-Checkpoint uses MLLMs as verifiers to automatically score instruction following, identity preservation, and structural coherence.
- 🔁 Dynamic Attention Rebalancing (DAR) offers a training-free plug-in that dynamically modulates attention weights during inference to reduce hallucination and improve cross-image consistency.
- 📊 Rigorous benchmarking across state-of-the-art UMMs reveals key limitations in multi-image context reasoning and demonstrates the effectiveness of DAR.
Status
- Submitted to CVPR 2026
Role: Co-first author
Recommended citation: **Hang Liu** (co-first author), et al. "MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models." Submitted to CVPR 2026.