Mohamed Elhoseiny

Time Blindness: Why Video-Language Models Can't See What Humans Can?

arXiv 2025

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

ICCV 2025

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

arXiv 2025

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

arXiv 2025

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

arXiv 2024

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

arXiv 2024

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

arXiv 2024

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

document-haystacks-vision-language-reasoning-1

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

arXiv 2024

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

arXiv 2024

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

arXiv 2024

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

arXiv 2024

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

arXiv 2023

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

arXiv 2023

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

arXiv 2023

Overcoming Generic Knowledge Loss with Selective Parameter Update

CVPR 2024 1

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

arXiv 2023

StoryGPT-V: Large Language Models as Consistent Story Visualizers

CVPR 2025 1

Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

ICCV 2023 1