Cunxiang Wang

Deep Research: A Systematic Survey

arXiv 2025

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

arXiv 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models

arXiv 2025

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

arXiv 2025

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

arXiv 2024

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

arXiv 2024

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

arXiv 2024

Knowledge Conflicts for LLMs: A Survey

arXiv 2024

A Survey on Evaluation of Large Language Models

arXiv 2023

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

arXiv 2023

TRAMS: Training-free Memory Selection for Long-range Language Modeling

arXiv 2023

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

arXiv 2023