Preference-based reinforcement learning (PbRL) is emerging as a promising approach to teaching robots through human comparative feedback, sidestepping the need for complex reward engineering. However, the substantial volume of feedback required in existing PbRL methods often lead to reliance on synthetic feedback generated by scripted teachers. While efficient, this approach necessitates intricate reward engineering again and struggles to adapt to the nuanced preferences particular to human-robot interaction (HRI) scenarios. To address these challenges, we introduce PrefCLM, a novel framework that utilizes crowdsourced large language models (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory (DST) to fuse individual preferences from multiple LLM agents at the score level, efficiently leveraging their diversity and collective intelligence. We also introduce a human-in-the-loop pipeline that facilitates collective refinements based on user interactive feedback. Experimental results across various general RL tasks show that PrefCLM achieves competitive performance compared to traditional scripted teachers and excels in facilitating more more natural and efficient behaviors. A real-world user study of PrefCLM (N=10) further demonstrates its capability to tailor robot behaviors to individual user preferences, significantly enhancing user satisfaction in HRI scenarios.
PrefCLM operates by leveraging the collective intelligence of multiple LLM agents to evaluate robot behaviors:
We tested PrefCLM on a range of standard RL benchmarks, including locomotion tasks (Walker, Cheetah, Quadruped) from the DeepMind Control Suite and manipulation tasks (Button Press, Door Unlock, Drawer Open) from Meta-World. We compared our method against two baselines: expert-tuned Scripted Teachers and PrefEVO, a single-LLM approach adapted from recent work on reward design.
Results from our analysis showed the following key findings:
This visual comparison below highlights how PrefCLM leads to more natural and efficient robot behaviors compared to traditional methods. Locomotion behaviors learned by Scripted Teachers (left) and PrefCLM (right) on the Cheetah Run task.
We conducted ablation studies to investigate the impact of crowdsourcing and DST fusion mechanisms within our framework. The ablation results below demonstrate the benefits of our crowdsourcing approach and the effectiveness of DST fusion in managing complexities and conflicts among LLM agents.
To assess PrefCLM's ability to personalize robot behaviors and enhance user satisfaction in realistic human-robot interaction scenarios, we conducted a user study with 10 participants on a robotic feeding task. We compared PrefCLM (equipped with our Human-In-The-Loop module) against PrefEVO and a pre-trained baseline policy.
We focused on a robotic feeding task using a Kinova Jaco assistive arm, equipped with a RealSense D435 camera for face tracking. The setup closely mirrored our simulation environment to minimize the sim-to-real gap, with an emergency switch in place for safety. We recruited 10 participants (3 female, 7 male) with an average age of 24.5 years. The study began with participants expressing their initial expectations for the task. We then fine-tuned policies in simulation, incorporating periodic real-world rollouts. Participants provided feedback during this process, which we used to refine the evaluation functions. For the final evaluation, participants interacted with three different policies in a randomized order: PrefCLM with our Human-In-The-Loop module, PrefEVO, and a pre-trained baseline. After each interaction, participants rated their overall satisfaction and perceived personalization on a 1-5 Likert scale. We also conducted semi-structured interviews to gather qualitative feedback.
Results from our analysis showed the following key findings:
Participants noted that the robot's behaviors under PrefCLM felt more natural and adaptive to their individual preferences. This study demonstrated PrefCLM's ability to effectively personalize robot behaviors and enhance user satisfaction in a realistic human-robot interaction scenario, highlighting its potential for practical applications in assistive robotics.
We prompt LLM agents to produce functions that regard the whole robot trajectory as the evaluative object, instead of single state-action pairs as considered in scripted teachers. These functions aim to evaluate the holistic patterns and changes across time-steps within the entire trajectory in addition to the immediate effectiveness of each state-action pair, ensuring a more nuanced evaluation akin to humans. Empirically, we observe that the evaluation functions, even those generated by homogeneous agents, exhibit diversity. This variation manifests in several ways, such as differing task-related criteria, assorted definitions for the same criteria, and varying priorities assigned to these criteria (e.g., different weighting schemes). Our PrefCLM capitalizes on this diversity, leveraging unique understanding that each LLM agent brings to the task and leading to a richer and more comprehensive evaluation process.
Example evaluation functions generated by crowdsourced LLM agents are shown below.
We thank the participants that participated in the human evaluation tests.
@article{prefclm2024,
title={PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models},
author={Wang, Ruiqi and Zhao, Dezhong and Yuan, Ziqin and Obi, Ike and Min, Byung-Cheol},
journal={arXiv preprint arXiv:2407.08213},
year={2024}
}