Salman Khan
- Papers
- 81
Cite
Notes
Only stored in your browser.
Authored papers
81MediX-R1: Open Ended Medical Reinforcement Learning
arXiv 2026
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
arXiv 2026
WorldCache: Content-Aware Caching for Accelerated Video World Models
arXiv 2026
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
arXiv 2026
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
arXiv 2026
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
arXiv 2026
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
arXiv 2026
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
arXiv 2026
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
arXiv 2025
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
arXiv 2025
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
arXiv 2025
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
arXiv 2025
StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models
arXiv 2025
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
arXiv 2025
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
arXiv 2025
Dr.LLM: Dynamic Layer Routing in LLMs
arXiv 2025
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
arXiv 2025
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
arXiv 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
arXiv 2025
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
arXiv 2025
AIN: The Arabic INclusive Large Multimodal Model
arXiv 2025
AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment
arXiv 2025
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
arXiv 2025
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
arXiv 2025
Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
arXiv 2025
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
arXiv 2025
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
arXiv 2025
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
arXiv 2025
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
arXiv 2025
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
arXiv 2025
Diversity Has Always Been There in Your Visual Autoregressive Models
arXiv 2025
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
arXiv 2025
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
arXiv 2024
Frontiers in Intelligent Colonoscopy
arXiv 2024
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
arXiv 2024
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities
arXiv 2024
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
CVPR 2025 1
Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
arXiv 2024
GroupMamba: Efficient Group-Based Visual State Space Model
CVPR 2025 1
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
CVPR 2024 1
PALO: A Polyglot Large Multimodal Model for 5B People
arXiv 2024
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
arXiv 2024
How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?
arXiv 2024
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
arXiv 2024
Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning
arXiv 2024
Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning
arXiv 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
arXiv 2024
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
arXiv 2024
BiMediX: Bilingual Medical Mixture of Experts LLM
arXiv 2024
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
arXiv 2024
FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background
arXiv 2024
COSNet: A Novel Semantic Segmentation Network using Enhanced Boundaries in Cluttered Scenes
arXiv 2024
Multi-modal Generation via Cross-Modal In-Context Learning
arXiv 2024
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
CVPR 2024 1
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
ICCV 2023 1
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2024 1
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023
Burstormer: Burst Image Restoration and Enhancement Transformer
CVPR 2023 1
Sentence-level Prompts Benefit Composed Image Retrieval
arXiv 2023
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
arXiv 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
arXiv 2023
PromptIR: Prompting for All-in-One Blind Image Restoration
arXiv 2023
Enhancing Novel Object Detection via Cooperative Foundational Models
arXiv 2023
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation
arXiv 2023
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment
arXiv 2023
Generative Multiplane Neural Radiance for 3D-Aware Image Generation
ICCV 2023 1
Towards Instance-adaptive Inference for Federated Learning
ICCV 2023 1
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation
ICCV 2023 1
Self-regulating Prompts: Foundational Model Adaptation without Forgetting
ICCV 2023 1
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
arXiv 2023
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts
arXiv 2023
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning
ICCV 2023 1
How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges
arXiv 2023
Modulate Your Spectrum in Self-Supervised Learning
arXiv 2023
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications
arXiv 2022
Fine-tuned CLIP Models are Efficient Video Learners
CVPR 2023 1
MaPLe: Multi-modal Prompt Learning
maple-multi-modal-prompt-learning-1
CLIP model is an Efficient Continual Learner
arXiv 2022
Handwriting Transformers
ICCV 2021 10
Restormer: Efficient Transformer for High-Resolution Image Restoration
CVPR 2022 1
Learning Enriched Features for Real Image Restoration and Enhancement
ECCV 2020 8
Affiliations
Frequent co-authors
10from 81 papers
Fahad Shahbaz Khan
Hisham Cholakkal
Rao Muhammad Anwer
Abdelrahman Shaker
Fahad Khan
Muhammad Maaz
Ming-Hsuan Yang
Muzammal Naseer
Omkar Thawakar
Ahmed Heakl