In the realm of Artificial Intelligence (AI), Large Language Models (LLMs) like ChatGPT, Claude, PaLM have emerged as the leaders in the industry. However, the quest for aligning these models with human values to ensure that bias and hallucinations are minimized is still an ongoing effort. Our R&D department has been testing out various approaches to address these issues. One of the many approaches we have tested so far is Reinforcement Learning from Human Feedback (RLHF). In this post, we want to educate you or your organization on the processes used within RLHF that can significantly enhance the accuracy of LLMs.
Step 1: Generating Initial Data
The journey begins by feeding a diverse range of prompts into your LLM, the responses to which should be recorded to form the initial dataset. This dataset acts as a baseline to measure the model's existing performance.
Note the responses generated from your large language model are known as completions
Step 2: Creating a Ranking Dataset
The next step involves human evaluators, also known as labelers, who review and rate the helpfulness of each completion based on a predefined criterion. A higher rating could be awarded to completions that are coherent, informative, and directly address the prompt.
It is important that the predefined criterion is correctly defined and human labelers accurately assess the response of each completion. This is necessary for your LLM to iteratively improve its completions without human intervention.
It is important to note that the approach taken to define the predefined criterion is key, otherwise the LLM will generate incorrect responses.
Step 3: Pairwise Comparison Dataset Preparation
This phase transforms the human feedback into a pairwise training dataset by comparing every possible pair of completions and assigning a reward based on the evaluators' rankings.
We can randomly select pairs of completions and set rewards. Each pair will contain a preferred completion and a non-preferred completion. We can signify a preferred completion with a 1, and a non preferred completion with a 0.
As can be seen in the second completion pairwise dataset produced in the above diagram, we swap the order of the completions in each pair to have the preferred completion first, as reward models generally prefer having the preferred completion before the non-preferred completion
Step 4: Training a Reward Model
Now that we have generated our pairwise dataset, we can train a reward model.
A binary classifier is deployed as a reward model to distinguish between more and less preferred completions based on human feedback. This step is vital for minimizing undesirable traits in generated text, such as toxicity.
Please note a binary classifier is the chosen reward model in this given use case, however Binary Classifiers are not the only type of machine learning model that can be used to train a Reward Model.
For more information please check the following article produced by Hugging Face on reward models
Step 5: RLHF Fine Tuning
Now, that our reward model is trained, it will evaluate new completions generated by the LLM and will produce a ranking for how accurate each completion is. The higher the ranking is, the better the completion.
These data points - the prompt, completion, and the corresponding reward score - are then fed into a Reinforcement Learning (RL) algorithm. This algorithm iteratively fine-tunes the LLM to enhance its responses, based on feedback from the reward model.
The relationship between the reward model and the RL algorithm forms a feedback loop, where each iteration propels the LLM closer to better alignment with human preferences.
Our R&D team is currently advancing our capabilities in Reinforcement Learning and experimenting with other LLM use cases. This expertise not only enhances our internal knowledge but also strengthens our ability to assist customers effectively
Upcoming Webinar Series
Join us in exploring practical techniques for fine-tuning Large Language Models in our upcoming webinar series. Sign up below to secure your spot and enhance your organizations LLM Techniques
Join Our Newsletter
If you are interested in the latest strategies, toolsets and reference architectures our team is building out in the areas of Artificial Intelligence, please click here to join our newsletter.
Comments