Mohit Bansal
- Papers
- 113
Cite
Notes
Only stored in your browser.
Authored papers
113EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
arXiv 2025
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
arXiv 2026
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
arXiv 2026
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
arXiv 2026
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
arXiv 2026
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
arXiv 2026
Stabilizing Efficient Reasoning with Step-Level Advantage Selection
arXiv 2026
Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills
arXiv 2025
AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
arXiv 2026
Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
arXiv 2026
Multimodal Fact-Level Attribution for Verifiable Reasoning
arXiv 2026
Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
arXiv 2026
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
arXiv 2026
OpenThoughts: Data Recipes for Reasoning Models
arXiv 2025
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
arXiv 2025
Retrieval-Augmented Generation with Conflicting Evidence
arXiv 2025
PRInTS: Reward Modeling for Long-Horizon Information Seeking
arXiv 2025
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
arXiv 2025
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
arXiv 2025
4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
arXiv 2025
SiLVR: A Simple Language-based Video Reasoning Framework
arXiv 2025
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
arXiv 2025
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
arXiv 2025
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
arXiv 2025
Learning to Generate Unit Tests for Automated Debugging
arXiv 2025
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
arXiv 2025
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
arXiv 2025
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
arXiv 2025
Error-Driven Scene Editing for 3D Grounding in Large Language Models
arXiv 2025
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
ICCV 2025
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
arXiv 2025
GenerationPrograms: Fine-grained Attribution with Executable Programs
arXiv 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
arXiv 2025
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
arXiv 2025
The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration
arXiv 2025
One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
arXiv 2025
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
arXiv 2025
TrustLLM: Trustworthiness in Large Language Models
arXiv 2024
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
arXiv 2024
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
arXiv 2024
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
arXiv 2024
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
arXiv 2024
ReGAL: Refactoring Programs to Discover Generalizable Abstractions
arXiv 2024
Soft Self-Consistency Improves Language Model Agents
arXiv 2024
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning
arXiv 2024
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
arXiv 2024
AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
arXiv 2024
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection
arXiv 2024
Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings
arXiv 2024
Glider: Global and Local Instruction-Driven Expert Router
arXiv 2024
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
arXiv 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
arXiv 2024
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
ICCV 2025
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
arXiv 2024
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
arXiv 2024
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
arXiv 2024
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
arXiv 2024
See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
arXiv 2024
Teaching Models to Balance Resisting and Accepting Persuasion
arXiv 2024
TIES-Merging: Resolving Interference When Merging Models
NeurIPS 2023 11
A Simple LLM Framework for Long-Range Video Question-Answering
arXiv 2023
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
arXiv 2023
Any-to-Any Generation via Composable Diffusion
NeurIPS 2023 11
ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization
arXiv 2023
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
arXiv 2023
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
arXiv 2023
MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
arXiv 2023
Generating Summaries with Controllable Readability Levels
arXiv 2023
Non-Sequential Graph Script Induction via Multimedia Grounding
arXiv 2023
Debiasing Multimodal Models via Causal Information Minimization
arXiv 2023
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023 1
Self-Chained Image-Language Model for Video Localization and Question Answering
self-chained-image-language-model-for-video
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
arXiv 2023
Hierarchical Video-Moment Retrieval and Step-Captioning
CVPR 2023 1
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
arXiv 2023
Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization
arXiv 2023
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
does-localization-inform-editing-surprising
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
arXiv 2023
Merging by Matching Models in Task Parameter Subspaces
arXiv 2023
Data Factors for Better Compositional Generalization
arXiv 2023
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
arXiv 2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
TMLR
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
arXiv 2022
Unifying Vision, Text, and Layout for Universal Document Processing
CVPR 2023 1
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
arXiv 2022
Fine-grained Image Captioning with CLIP Reward
Findings (NAACL) 2022 7
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
arXiv 2022
Evaluating the Factual Consistency of Large Language Models Through News Summarization
arXiv 2022
Explanation Graph Generation via Pre-trained Language Models: An Empirical Study with Contrastive Learning
ACL 2022 5
CAISE: Conversational Agent for Image Search and Editing
arXiv 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
ICCV 2023 1
TVLT: Textless Vision-Language Transformer
arXiv 2022
Exclusive Supermask Subnetwork Training for Continual Learning
arXiv 2022
Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
arXiv 2022
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
arXiv 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
arXiv 2021
Robustness Gym: Unifying the NLP Evaluation Landscape
NAACL 2021 4
How Much Can CLIP Benefit Vision-and-Language Tasks?
arXiv 2021
Unifying Vision-and-Language Tasks via Text Generation
arXiv 2021
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
CVPR 2022 1
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
LNLS (ACL) 2022 5
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
NeurIPS 2021 12
Summary-Source Proposition-level Alignment: Task, Datasets and Supervised Baseline
CoNLL (EMNLP) 2021 11
ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization
EMNLP 2020 11
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
ECCV 2020 8
What is More Likely to Happen Next? Video-and-Language Future Event Prediction
EMNLP 2020 11
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
evaluating-explainable-ai-which-algorithmic-1
TVQA+: Spatio-Temporal Grounding for Video Question Answering
tvqa-spatio-temporal-grounding-for-video-1
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
lxmert-learning-cross-modality-encoder-1
Adversarial NLI: A New Benchmark for Natural Language Understanding
adversarial-nli-a-new-benchmark-for-natural-1
Expressing Visual Relationships via Language
expressing-visual-relationships-via-language-1
PaperRobot: Incremental Draft Generation of Scientific Ideas
paperrobot-incremental-draft-generation-of-1
Combining Fact Extraction and Verification with Neural Semantic Matching Networks
arXiv 2018
Affiliations
Frequent co-authors
10from 113 papers