0

Mohit Bansal

Papers
113

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
113papers

Authored papers

113

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

arXiv 2025

2026

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

arXiv 2026

2026

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

arXiv 2026

2026

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

arXiv 2026

2026

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

arXiv 2026

2026

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

arXiv 2026

2026

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

arXiv 2026

2026

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

arXiv 2025

2026

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

arXiv 2026

2026

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

arXiv 2026

2026

Multimodal Fact-Level Attribution for Verifiable Reasoning

arXiv 2026

2026

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

arXiv 2026

2026

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

arXiv 2026

2026

OpenThoughts: Data Recipes for Reasoning Models

arXiv 2025

2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

2025

Retrieval-Augmented Generation with Conflicting Evidence

arXiv 2025

2025

PRInTS: Reward Modeling for Long-Horizon Information Seeking

arXiv 2025

2025

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

arXiv 2025

2025

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

arXiv 2025

2025

4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

arXiv 2025

2025

SiLVR: A Simple Language-based Video Reasoning Framework

arXiv 2025

2025

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

arXiv 2025

2025

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

arXiv 2025

2025

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

arXiv 2025

2025

Learning to Generate Unit Tests for Automated Debugging

arXiv 2025

2025

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

arXiv 2025

2025

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

arXiv 2025

2025

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

arXiv 2025

2025

Error-Driven Scene Editing for 3D Grounding in Large Language Models

arXiv 2025

2025

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

ICCV 2025

2025

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

arXiv 2025

2025

GenerationPrograms: Fine-grained Attribution with Executable Programs

arXiv 2025

2025

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

arXiv 2025

2025

UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

arXiv 2025

2025

The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration

arXiv 2025

2025

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

arXiv 2025

2025

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

arXiv 2025

2025

TrustLLM: Trustworthiness in Large Language Models

arXiv 2024

2024

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

arXiv 2024

2024

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

arXiv 2024

2024

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

arXiv 2024

2024

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

arXiv 2024

2024

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

arXiv 2024

2024

Soft Self-Consistency Improves Language Model Agents

arXiv 2024

2024

MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

arXiv 2024

2024

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

arXiv 2024

2024

AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge

arXiv 2024

2024

Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

arXiv 2024

2024

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

arXiv 2024

2024

Glider: Global and Local Instruction-Driven Expert Router

arXiv 2024

2024

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

arXiv 2024

2024

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

arXiv 2024

2024

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

ICCV 2025

2024

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

arXiv 2024

2024

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

arXiv 2024

2024

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

arXiv 2024

2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

arXiv 2024

2024

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

arXiv 2024

2024

Teaching Models to Balance Resisting and Accepting Persuasion

arXiv 2024

2024

TIES-Merging: Resolving Interference When Merging Models

NeurIPS 2023 11

2023

A Simple LLM Framework for Long-Range Video Question-Answering

arXiv 2023

2023

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

arXiv 2023

2023

Any-to-Any Generation via Composable Diffusion

NeurIPS 2023 11

2023

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

arXiv 2023

2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

arXiv 2023

2023

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

arXiv 2023

2023

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

arXiv 2023

2023

Generating Summaries with Controllable Readability Levels

arXiv 2023

2023

Non-Sequential Graph Script Induction via Multimedia Grounding

arXiv 2023

2023

Debiasing Multimodal Models via Causal Information Minimization

arXiv 2023

2023

Scaling Data Generation in Vision-and-Language Navigation

ICCV 2023 1

2023

Self-Chained Image-Language Model for Video Localization and Question Answering

self-chained-image-language-model-for-video

2023

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

arXiv 2023

2023

Hierarchical Video-Moment Retrieval and Step-Captioning

CVPR 2023 1

2023

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

arXiv 2023

2023

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization

arXiv 2023

2023

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

does-localization-inform-editing-surprising

2023

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

arXiv 2023

2023

Merging by Matching Models in Task Parameter Subspaces

arXiv 2023

2023

Data Factors for Better Compositional Generalization

arXiv 2023

2023

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

arXiv 2023

2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

2022

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

arXiv 2022

2022

Unifying Vision, Text, and Layout for Universal Document Processing

CVPR 2023 1

2022

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

arXiv 2022

2022

Fine-grained Image Captioning with CLIP Reward

Findings (NAACL) 2022 7

2022

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

arXiv 2022

2022

Evaluating the Factual Consistency of Large Language Models Through News Summarization

arXiv 2022

2022

Explanation Graph Generation via Pre-trained Language Models: An Empirical Study with Contrastive Learning

ACL 2022 5

2022

CAISE: Conversational Agent for Image Search and Editing

arXiv 2022

2022

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

ICCV 2023 1

2022

TVLT: Textless Vision-Language Transformer

arXiv 2022

2022

Exclusive Supermask Subnetwork Training for Continual Learning

arXiv 2022

2022

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

arXiv 2022

2022

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

arXiv 2022

2022

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

arXiv 2021

2021

Robustness Gym: Unifying the NLP Evaluation Landscape

NAACL 2021 4

2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

arXiv 2021

2021

Unifying Vision-and-Language Tasks via Text Generation

arXiv 2021

2021

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

CVPR 2022 1

2021

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

LNLS (ACL) 2022 5

2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

NeurIPS 2021 12

2021

Summary-Source Proposition-level Alignment: Task, Datasets and Supervised Baseline

CoNLL (EMNLP) 2021 11

2020

ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

EMNLP 2020 11

2020

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

ECCV 2020 8

2020

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

EMNLP 2020 11

2020

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

evaluating-explainable-ai-which-algorithmic-1

2020

TVQA+: Spatio-Temporal Grounding for Video Question Answering

tvqa-spatio-temporal-grounding-for-video-1

2019

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

lxmert-learning-cross-modality-encoder-1

2019

Adversarial NLI: A New Benchmark for Natural Language Understanding

adversarial-nli-a-new-benchmark-for-natural-1

2019

Expressing Visual Relationships via Language

expressing-visual-relationships-via-language-1

2019

PaperRobot: Incremental Draft Generation of Scientific Ideas

paperrobot-incremental-draft-generation-of-1

2019

Combining Fact Extraction and Verification with Neural Semantic Matching Networks

arXiv 2018

2018

Affiliations

No known affiliations.

Frequent co-authors

10

from 113 papers