Can LLMs Hire Fairly? Racial Bias in Resume Screening

We audit fourteen mainstream large language models (LLMs) for hiring discrimination using the paired-resume methodology of Kline, Rose, and Walters (2022). The sole 2023-vintage model reproduces the pro-White callback gap documented in field experiments on labor market discrimination (+2.12 pp, significant at the 1% level). Every model released in 2024 or after shows either a null gap or a significant pro-Black reversal (up to -3.01 pp). The same pattern holds on the gender axis. Based on 24,024 paired postings per model across 14 models, our results document a reversal in the direction of algorithmic hiring bias across model generations.