Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=375), and four bias types. Our headline practical finding is that a mid-tier model with the right debiasing can outperform frontier judges at a fraction of the cost: Gemini 2.5 Flash with the Combined Budget strategy reaches the highest agreement of any configuration we tested (71.0%, kappa=0.549) at 0.001 per evaluation, about 15x cheaper than the best frontier setup (Claude Sonnet 4, 69.5%, 0.015). Other key findings: (1) Style bias is the dominant bias (0.10-0.76 across models, favoring markdown over plain prose), far exceeding position bias (<=0.04), yet is rarely studied. (2) Verbosity bias is heterogeneous when measured length-aware: Pro, Flash, and Llama prefer longer answers (+0.24 to +0.44), Claude prefers concise (-0.12), and GPT-4o is neutral (-0.04); on truncation controls all models correctly prefer the complete response (0.88-1.00 accuracy). (3) Debiasing helps multiple models: Claude S8 (+11.5 pp), Flash S8 (+7.5 pp), and Claude S5 (+7.3 pp) survive Holm-Bonferroni correction, with Flash S1 (+4.7 pp) and Llama S8 (+4.5 pp) also significant. We release our evaluation framework, the 375-pair controlled dataset, and per-instance cached results for all nine strategies.