0

Salman Khan

Papers
81

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
81papers

Authored papers

81

MediX-R1: Open Ended Medical Reinforcement Learning

arXiv 2026

2026

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

arXiv 2026

2026

WorldCache: Content-Aware Caching for Accelerated Video World Models

arXiv 2026

2026

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

arXiv 2026

2026

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

arXiv 2026

2026

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

arXiv 2026

2026

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

arXiv 2026

2026

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

arXiv 2026

2026

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

arXiv 2025

2025

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

arXiv 2025

2025

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

arXiv 2025

2025

KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

arXiv 2025

2025

StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

arXiv 2025

2025

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

arXiv 2025

2025

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

arXiv 2025

2025

Dr.LLM: Dynamic Layer Routing in LLMs

arXiv 2025

2025

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

arXiv 2025

2025

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

arXiv 2025

2025

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

arXiv 2025

2025

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

arXiv 2025

2025

AIN: The Arabic INclusive Large Multimodal Model

arXiv 2025

2025

AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment

arXiv 2025

2025

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv 2025

2025

C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

arXiv 2025

2025

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

arXiv 2025

2025

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv 2025

2025

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

arXiv 2025

2025

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

arXiv 2025

2025

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

arXiv 2025

2025

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

arXiv 2025

2025

Diversity Has Always Been There in Your Visual Autoregressive Models

arXiv 2025

2025

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv 2025

2025

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

arXiv 2024

2024

Frontiers in Intelligent Colonoscopy

arXiv 2024

2024

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv 2024

2024

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

arXiv 2024

2024

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

CVPR 2025 1

2024

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

arXiv 2024

2024

GroupMamba: Efficient Group-Based Visual State Space Model

CVPR 2025 1

2024

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

CVPR 2024 1

2024

PALO: A Polyglot Large Multimodal Model for 5B People

arXiv 2024

2024

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

arXiv 2024

2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

arXiv 2024

2024

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

arXiv 2024

2024

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

arXiv 2024

2024

Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning

arXiv 2024

2024

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

arXiv 2024

2024

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

arXiv 2024

2024

BiMediX: Bilingual Medical Mixture of Experts LLM

arXiv 2024

2024

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

arXiv 2024

2024

FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background

arXiv 2024

2024

COSNet: A Novel Semantic Segmentation Network using Enhanced Boundaries in Cluttered Scenes

arXiv 2024

2024

Multi-modal Generation via Cross-Modal In-Context Learning

arXiv 2024

2024

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

CVPR 2024 1

2023

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

ICCV 2023 1

2023

GLaMM: Pixel Grounding Large Multimodal Model

CVPR 2024 1

2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv 2023

2023

Burstormer: Burst Image Restoration and Enhancement Transformer

CVPR 2023 1

2023

Sentence-level Prompts Benefit Composed Image Retrieval

arXiv 2023

2023

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

arXiv 2023

2023

Foundational Models Defining a New Era in Vision: A Survey and Outlook

arXiv 2023

2023

PromptIR: Prompting for All-in-One Blind Image Restoration

arXiv 2023

2023

Enhancing Novel Object Detection via Cooperative Foundational Models

arXiv 2023

2023

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

arXiv 2023

2023

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

arXiv 2023

2023

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

ICCV 2023 1

2023

Towards Instance-adaptive Inference for Federated Learning

ICCV 2023 1

2023

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

ICCV 2023 1

2023

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

ICCV 2023 1

2023

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

arXiv 2023

2023

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

arXiv 2023

2023

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

ICCV 2023 1

2023

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

arXiv 2023

2023

Modulate Your Spectrum in Self-Supervised Learning

arXiv 2023

2023

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

arXiv 2022

2022

Fine-tuned CLIP Models are Efficient Video Learners

CVPR 2023 1

2022

MaPLe: Multi-modal Prompt Learning

maple-multi-modal-prompt-learning-1

2022

CLIP model is an Efficient Continual Learner

arXiv 2022

2022

Handwriting Transformers

ICCV 2021 10

2021

Restormer: Efficient Transformer for High-Resolution Image Restoration

CVPR 2022 1

2021

Learning Enriched Features for Real Image Restoration and Enhancement

ECCV 2020 8

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 81 papers