Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

Large Language Models (LLMs) have shown remarkable potential in developing role-playing agents (RPAs). However, current evaluation frameworks rely heavily on well-known fictional characters, raising a critical concern: models may be leveraging their internal training memory of these characters rather than demonstrating role-playing capabilities. This reliance often leads to significant performance degradation when RPAs encounter unseen or out-of-distribution personas. To address this, we propose a more rigorous evaluation protocol designed to decouple role-playing proficiency from character recognition. Our experiments across multiple benchmarks demonstrate that anonymizing characters degrades performance, confirming that name exposure provides implicit cues that mask a model's true capability. To mitigate this, we investigate diverse personality augmentation as a method to enhance role fidelity in anonymous settings. We systematically analyze the impact of various personality-description methods on agent behavior and consistency. Our results show that incorporating personality information consistently improves RPA performance. This work establishes a more equitable evaluation standard and validates a scalable, personality-enhanced framework for constructing robust RPAs.