Skip to main content

Drs. Jiancong Xiao, Weijie Su, and Qi Long have developed a novel, robust strategy for mitigating algorithmic bias in large language models (LLMs) in their paper published in the Special Issue on Statistical Science in Artificial Intelligence of the Journal of the American Statistical Association.

Their work reveals and addresses a critical challenge in AI alignment: when LLMs are trained with human feedback, the process often favors majority opinions and marginalizes minority viewpoints, a phenomenon the authors term preference collapse. To overcome this issue, the team developed preference matching RLHF, a novel approach that aligns LLMs with human preferences while preserving diversity of feedback. The framework reveals how bias can emerge during alignment and proposes a statistically grounded regularization strategy to counter it. This research lays a new foundation for developing trustworthy and socially responsible AI systems, offering broad implications for building models that better reflect the values of all users and set a higher standard for fairness and transparency in AI alignment.

This work was led by Dr. Jiancong Xiao, who is a postdoctoral researcher under the joint supervision of Dr. Weijie Su and Dr. Qi Long.