Qibin Hou

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

arXiv 2026

Mixture of Style Experts for Diverse Image Stylization

arXiv 2026

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

arXiv 2026

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

arXiv 2026

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

arXiv 2026

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

arXiv 2026

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

CVPR 2025 1

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

CVPR 2025 1

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

arXiv 2025

Depth Anything at Any Condition

arXiv 2025

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

arXiv 2025

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

arXiv 2025

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

arXiv 2025

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

arXiv 2025

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

arXiv 2025

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

arXiv 2024

DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

ICCV 2025

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

arXiv 2024

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

arXiv 2024

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

arXiv 2024

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

arXiv 2023

Large Selective Kernel Network for Remote Sensing Object Detection

ICCV 2023 1

ChatAnything: Facetime Chat with LLM-Enhanced Personas

arXiv 2023

SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

ICCV 2023 1

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

CVPR 2023 1

MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention

arXiv 2023

CrossKD: Cross-Head Knowledge Distillation for Object Detection

CVPR 2024 1

CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation

CVPR 2024 1

Referring Camouflaged Object Detection

arXiv 2023

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

arXiv 2023