TallyTrain: Communication-Efficient Federated Distillation

Federated learning is bandwidth-bound on two orthogonal axes: model size, which limits how often parameter-averaging methods can afford to merge, and class count, which makes per-probe soft-label distillation prohibitive at large vocabularies. Both ceilings tighten as modern systems scale. We collapse the class-count axis to \lceil \log_2 C \rceil bits per probe by transmitting only each peer's \arg\max class index, where C is the number of output classes. The resulting protocol, TallyTrain, is not merely compressed: under non-IID training it can be preferable to soft-label distillation, because under-trained peers are confidently wrong and majority voting filters this noise where soft-label averaging amplifies it. Across standard benchmarks, TallyTrain matches or beats soft-label distillation at up to three orders of magnitude less communication. We also relax the model-size axis: we compose the cheap hard-label consensus with sparse parameter merges to obtain a bandwidth-bridge variant, which Pareto-dominates every tested operating point of the standard FedAvg, FedProx and FedDF baselines.