Jingdong Wang
- Papers
- 44
Cite
Notes
Only stored in your browser.
Authored papers
44RefAlign: Representation Alignment for Reference-to-Video Generation
arXiv 2026
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
arXiv 2026
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
arXiv 2025
Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation
arXiv 2025
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
arXiv 2025
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
arXiv 2025
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
CVPR 2025 1
LION: Linear Group RNN for 3D Object Detection in Point Clouds
arXiv 2024
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
arXiv 2024
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
arXiv 2024
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
arXiv 2024
Dense Connector for MLLMs
arXiv 2024
MS-DETR: Efficient DETR Training with Mixed Supervision
CVPR 2024 1
MonoFormer: One Transformer for Both Diffusion and Autoregression
arXiv 2024
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
arXiv 2024
TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On
arXiv 2024
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
arXiv 2024
Training-Free Unsupervised Prompt for Vision-Language Models
arXiv 2024
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
arXiv 2024
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
arXiv 2024
A Survey of Reasoning with Foundation Models
arXiv 2023
HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
hap-structure-aware-masked-image-modeling-for
Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement
ICCV 2023 1
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
arXiv 2023
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
leveraging-vision-centric-multi-modal
PLIP: Language-Image Pre-training for Person Representation Learning
arXiv 2023
What Can Simple Arithmetic Operations Do for Temporal Modeling?
ICCV 2023 1
UATVR: Uncertainty-Adaptive Text-Video Retrieval
ICCV 2023 1
Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
ICCV 2023 1
CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation
ICCV 2023 1
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
unified-pre-training-with-pseudo-texts-for
IRGen: Generative Modeling for Image Retrieval
arXiv 2023
Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching
ICCV 2023 1
Context Autoencoder for Self-Supervised Representation Learning
arXiv 2022
DaViT: Dual Attention Vision Transformers
arXiv 2022
Few-Shot Font Generation by Learning Fine-Grained Local Styles
CVPR 2022 1
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment
ICCV 2023 1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
CVPR 2023 1
SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search
spann-highly-efficient-billion-scale-1
Conditional DETR for Fast Training Convergence
ICCV 2021 10
Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
CVPR 2021 1
Lite-HRNet: A Lightweight High-Resolution Network
CVPR 2021 1
Deep High-Resolution Representation Learning for Human Pose Estimation
deep-high-resolution-representation-learning-1
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation
ECCV 2020 8
Affiliations
Frequent co-authors
10from 44 papers