PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models

Abstract

Preference-based reinforcement learning (PbRL) is emerging as a promising approach to teaching robots through human comparative feedback, sidestepping the need for complex reward engineering. However, the substantial volume of feedback required in existing PbRL methods often lead to reliance on synthetic feedback generated by scripted teachers. While efficient, this approach necessitates intricate reward engineering again and struggles to adapt to the nuanced preferences particular to human-robot interaction (HRI) scenarios. To address these challenges, we introduce PrefCLM, a novel framework that utilizes crowdsourced large language models (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory (DST) to fuse individual preferences from multiple LLM agents at the score level, efficiently leveraging their diversity and collective intelligence. We also introduce a human-in-the-loop pipeline that facilitates collective refinements based on user interactive feedback. Experimental results across various general RL tasks show that PrefCLM achieves competitive performance compared to traditional scripted teachers and excels in facilitating more more natural and efficient behaviors. A real-world user study of PrefCLM (N=10) further demonstrates its capability to tailor robot behaviors to individual user preferences, significantly enhancing user satisfaction in HRI scenarios.

How PrefCLM Works

PrefCLM operates by leveraging the collective intelligence of multiple LLM agents to evaluate robot behaviors:

Given task-specific contextual information and prompts, multiple code-based evaluation functions are sampled from crowd LLM agents.
A cosine similarity check module then filters the sampled evaluation functions, selecting those that align with few-shot expert preferences within a specified tolerance.
Evaluative scores are continuously assigned by these selected evaluation functions to pairs of robot trajectories. These scores are aggregated through Dempster-Shafer Theory (DST) fusion to form crowdsourced preferences, which are used for the reward learning in PbRL.
Additionally, crowd LLM agents can also collectively refine their evaluation functions based on user interactive inputs given periodically in HRI scenarios.

Experiments & Results: General RL Tasks

We tested PrefCLM on a range of standard RL benchmarks, including locomotion tasks (Walker, Cheetah, Quadruped) from the DeepMind Control Suite and manipulation tasks (Button Press, Door Unlock, Drawer Open) from Meta-World. We compared our method against two baselines: expert-tuned Scripted Teachers and PrefEVO, a single-LLM approach adapted from recent work on reward design.

Results from our analysis showed the following key findings:

PrefCLM achieved comparable or superior performance to expert-tuned Scripted Teachers across most tasks
PrefCLM outperformed PrefEVO, demonstrating the benefits of its crowdsourcing approach
Few-shot mode of PrefCLM showed advantages over zero-shot, especially for complex tasks
Ablation studies revealed benefits of increasing crowd size and using DST fusion over majority voting

This visual comparison below highlights how PrefCLM leads to more natural and efficient robot behaviors compared to traditional methods. Locomotion behaviors learned by Scripted Teachers (left) and PrefCLM (right) on the Cheetah Run task.

We conducted ablation studies to investigate the impact of crowdsourcing and DST fusion mechanisms within our framework. The ablation results below demonstrate the benefits of our crowdsourcing approach and the effectiveness of DST fusion in managing complexities and conflicts among LLM agents.

Real World User Study: Experiments & Results

To assess PrefCLM's ability to personalize robot behaviors and enhance user satisfaction in realistic human-robot interaction scenarios, we conducted a user study with 10 participants on a robotic feeding task. We compared PrefCLM (equipped with our Human-In-The-Loop module) against PrefEVO and a pre-trained baseline policy.

We focused on a robotic feeding task using a Kinova Jaco assistive arm, equipped with a RealSense D435 camera for face tracking. The setup closely mirrored our simulation environment to minimize the sim-to-real gap, with an emergency switch in place for safety. We recruited 10 participants (3 female, 7 male) with an average age of 24.5 years. The study began with participants expressing their initial expectations for the task. We then fine-tuned policies in simulation, incorporating periodic real-world rollouts. Participants provided feedback during this process, which we used to refine the evaluation functions. For the final evaluation, participants interacted with three different policies in a randomized order: PrefCLM with our Human-In-The-Loop module, PrefEVO, and a pre-trained baseline. After each interaction, participants rated their overall satisfaction and perceived personalization on a 1-5 Likert scale. We also conducted semi-structured interviews to gather qualitative feedback.

Results from our analysis showed the following key findings:

PrefCLM achieved higher user satisfaction and personalization ratings compared to baselines
Qualitative feedback from the participants indicated that PrefCLM produced more natural and personalized robot behaviors

Participants noted that the robot's behaviors under PrefCLM felt more natural and adaptive to their individual preferences. This study demonstrated PrefCLM's ability to effectively personalize robot behaviors and enhance user satisfaction in a realistic human-robot interaction scenario, highlighting its potential for practical applications in assistive robotics.

Example LLM-based Evaluation Functions in PrefCLM

We prompt LLM agents to produce functions that regard the whole robot trajectory as the evaluative object, instead of single state-action pairs as considered in scripted teachers. These functions aim to evaluate the holistic patterns and changes across time-steps within the entire trajectory in addition to the immediate effectiveness of each state-action pair, ensuring a more nuanced evaluation akin to humans. Empirically, we observe that the evaluation functions, even those generated by homogeneous agents, exhibit diversity. This variation manifests in several ways, such as differing task-related criteria, assorted definitions for the same criteria, and varying priorities assigned to these criteria (e.g., different weighting schemes). Our PrefCLM capitalizes on this diversity, leveraging unique understanding that each LLM agent brings to the task and leading to a richer and more comprehensive evaluation process.

Select an image above

Example evaluation functions generated by crowdsourced LLM agents are shown below.

Acknowledgement

We thank the participants that participated in the human evaluation tests.

BibTeX

@article{prefclm2024,
  title={PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced Large Language Models},
  author={Wang, Ruiqi and Zhao, Dezhong and Yuan, Ziqin and Obi, Ike and Min, Byung-Cheol},
  journal={arXiv preprint arXiv:2407.08213},
  year={2024}
}

PrefCLM: Enhancing Preference-based Reinforcement Learning with Crowdsourced LLMs