Our research
Our research shows that Rule-Based Rewards (RBRs) significantly enhance the safety ofour Al systems, making them safer and more reliable for people and developers to use every day. This is part of our work to explore more ways we can apply our own Al to make Al safer.
Traditionally, fine-tuning language models using reinforcement learning from human
feedback (RLHF) has been the go-to method for ensuring they follow instructions
accurately. OpenAl has been at the forefront of developing these alignment methods to create smarter and safer Al models.
To ensure Al systems behave safely and align with human values, we define desired
behaviors and collect human feedback to train a "reward model." This model guides the Al by signaling desirable actions. However, collecting this human feedback for routine and repetitive tasks is often inefficient. Additionally, if our safety policies change, the feedback we've already collected might become outdated, requiring new data.
Thus, we introduce Rule-Based Rewards (RBRs) as a key component of OpenAl's safety stack to align model behavior with desired safe behavior. Unlike human feedback, RBRs uses clear, simple, and step-by-step rules to evaluate if the model's outputs meet safety standards. When plugged into the standard RLHF pipeline, it helps maintain a good balance between being helpful while preventing harm, to ensure the model behaves safely and effectively without the inefficiencies of recurrent human inputs. We have used RBRs as part of our safety stack since our GPT-4 launch, including GPT-4o mini, and we plan to implement it in our models moving forward.
Rule-Based Rewards (RBRs)
Our research shows that Rule-Based Rewards (RBRs) significantly enhance the safety ofour Al systems, making them safer and more reliable for people and developers to use every day. This is part of our work to explore more ways we can apply our own Al to make Al safer.
Traditionally, fine-tuning language models using reinforcement learning from human
feedback (RLHF) has been the go-to method for ensuring they follow instructions
accurately. OpenAl has been at the forefront of developing these alignment methods to create smarter and safer Al models.
To ensure Al systems behave safely and align with human values, we define desired
behaviors and collect human feedback to train a "reward model." This model guides the Al by signaling desirable actions. However, collecting this human feedback for routine and repetitive tasks is often inefficient. Additionally, if our safety policies change, the feedback we've already collected might become outdated, requiring new data.
What’s ahead
Thus, we introduce Rule-Based Rewards (RBRs) as a key component of OpenAl's safety stack to align model behavior with desired safe behavior. Unlike human feedback, RBRs uses clear, simple, and step-by-step rules to evaluate if the model's outputs meet safety standards. When plugged into the standard RLHF pipeline, it helps maintain a good balance between being helpful while preventing harm, to ensure the model behaves safely and effectively without the inefficiencies of recurrent human inputs. We have used RBRs as part of our safety stack since our GPT-4 launch, including GPT-4o mini, and we plan to implement it in our models moving forward.