PengFei Liu
- Papers
- 83
Cite
Notes
Only stored in your browser.
Authored papers
83SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
arXiv 2026
ASI-Evolve: AI Accelerates AI
arXiv 2026
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
arXiv 2026
daVinci-Env: Open SWE Environment Synthesis at Scale
arXiv 2026
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
arXiv 2026
LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
arXiv 2026
Hybrid Policy Distillation for LLMs
arXiv 2026
daVinci-LLM:Towards the Science of Pretraining
arXiv 2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
arXiv 2026
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
arXiv 2026
One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
arXiv 2026
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
arXiv 2026
Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
arXiv 2026
daVinci-Dev: Agent-native Mid-training for Software Engineering
arXiv 2026
AcademiClaw: When Students Set Challenges for AI Agents
arXiv 2026
PRBench: End-to-end Paper Reproduction in Physics Research
arXiv 2026
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
arXiv 2025
Seed1.5-VL Technical Report
arXiv 2025
Thinking with Generated Images
arXiv 2025
LIMO: Less is More for Reasoning
arXiv 2025
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
arXiv 2025
Generative AI Act II: Test Time Scaling Drives Cognition Engineering
arXiv 2025
Efficient Agent Training for Computer Use
arXiv 2025
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
arXiv 2025
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
arXiv 2025
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
arXiv 2025
LIMI: Less is More for Agency
arXiv 2025
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
arXiv 2025
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
arXiv 2025
DIVE: Diversified Iterative Self-Improvement
arXiv 2025
One RL to See Them All: Visual Triple Unified Reinforcement Learning
arXiv 2025
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling
arXiv 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
arXiv 2025
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
arXiv 2025
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
arXiv 2025
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
arXiv 2025
LIMR: Less is More for RL Scaling
arXiv 2025
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
arXiv 2025
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing
arXiv 2025
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
arXiv 2025
Halu-J: Critique-Based Hallucination Judge
arXiv 2024
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
arXiv 2024
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
arXiv 2024
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
arXiv 2024
MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation
arXiv 2024
OpenResearcher: Unleashing AI for Accelerated Scientific Research
arXiv 2024
OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
arXiv 2024
Extending LLMs' Context Window with 100 Samples
arXiv 2024
Weak-to-Strong Reasoning
arXiv 2024
Benchmarking Benchmark Leakage in Large Language Models
arXiv 2024
InFoBench: Evaluating Instruction Following Ability in Large Language Models
arXiv 2024
The Critique of Critique
arXiv 2024
BeHonest: Benchmarking Honesty in Large Language Models
arXiv 2024
Dissecting Human and LLM Preferences
arXiv 2024
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
arXiv 2024
A quantitative analysis of knowledge-learning preferences in large language models in molecular science
arXiv 2024
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models
arXiv 2024
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
arXiv 2024
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
arXiv 2024
Evaluating Mathematical Reasoning Beyond Accuracy
arXiv 2024
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
arXiv 2024
Understanding Reference Policies in Direct Preference Optimization
arXiv 2024
ECon: On the Detection and Resolution of Evidence Conflicts
arXiv 2024
Reformatted Alignment
arXiv 2024
FELM: Benchmarking Factuality Evaluation of Large Language Models
NeurIPS 2023 11
Alignment for Honesty
arXiv 2023
MathPile: A Billion-Token-Scale Pretraining Corpus for Math
arXiv 2023
Generative Judge for Evaluating Alignment
arXiv 2023
GPTScore: Evaluate as You Desire
arXiv 2023
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
arXiv 2023
Align on the Fly: Adapting Chatbot Behavior to Established Norms
arXiv 2023
On Learning to Summarize with Large Language Models as References
arXiv 2023
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
arXiv 2023
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation
arXiv 2023
BRIO: Bringing Order to Abstractive Summarization
ACL 2022 5
Towards a Unified Multi-Dimensional Evaluator for Text Generation
arXiv 2022
reStructured Pre-training
arXiv 2022
I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation
arXiv 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
arXiv 2022
BARTScore: Evaluating Generated Text as Text Generation
NeurIPS 2021 12
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
arXiv 2021
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
EMNLP 2021 11
SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization
ACL 2021 5
Affiliations
Frequent co-authors
10from 83 papers