Eliciting latent knowledge from language reward models

Code for my thesis titled "Eliciting latent knowledge from language reward models" for the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge.

Idea

Use methods that discover latent knowledge (DLK), such as CCS, to build reward models that promote truthfulness. Utilize these reward models to execute reinforcement learning (RL) fine-tuning to improve the "truthfulness" of LLMs.

The core code bit from the project is the fine-tuning training loop:

import torch
import string


CHARACTERS_TO_FILTER = string.punctuation + " \n"


def is_answer_yes_no(answer):
    return answer in ["Yes", "No"]


def postprocess_response(response):
    while response and response[-1] in CHARACTERS_TO_FILTER:
        response = response[:-1]
    return response


def train(
    ppo_trainer,
    tokenizer,
    generation_kwargs,
    get_rewards,
    script_args, config,
):
    n_epochs = config.steps // len(ppo_trainer.dataloader)

    for epoch in range(1, n_epochs + 1):
        loop = tqdm(
            enumerate(ppo_trainer.dataloader, 1),
            total=len(ppo_trainer.dataloader), leave=False
        )
        for batch_idx, batch in loop:
            # Get the input tensors
            question_tensors = batch["input_ids"]

            # Get the generations
            response_tensors = ppo_trainer.generate(
                question_tensors,
                return_prompt=False,
                batch_size=script_args.generator_batch_size,
                **generation_kwargs,
            )
            responses = tokenizer.batch_decode(
                response_tensors, skip_special_tokens=True,
                spaces_between_special_tokens=False
            )

            # Postprocess the responses
            if script_args.postprocess_responses:
                responses = [postprocess_response(x) for x in responses]
            batch["response"] = responses

            # Compute the rewards (scores)
            texts = [q + " " + r for q, r in zip(batch["query"], batch["response"])]
            rewards = get_rewards(texts)

            # Replace reward for undesired answers to -1
            mask = [not is_answer_yes_no(x) for x in batch["response"]]
            mask = torch.tensor(mask, dtype=torch.bool) # cast to tensor
            rewards[mask] = -1

            # Make the rewards a list of tensors
            rewards = [x for x in rewards]

            # Run PPO step
            stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
            ppo_trainer.log_stats(stats, batch, rewards)

Note that the generation_kwargs look something like this:

generation_kwargs = {
    "top_k": 0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": 100_000,
    "pad_to_multiple_of": 8,
    "max_new_tokens": 2,
}

For more details, see the accompanying blog post and the full pdf of the thesis.

The core libraries are elk (link) for eliciting latent knowledge, trl (link) for RL fine-tuning, and lm-evaluation-harness (link).

Installation and prerequisites

Clone the repository.
Create a new conda environment in which all the libraries will be downloaded. Note I have dumped the dependencies used in my environment at the end of the project into environment.yml file, but it is not always possible to easily install from it by simply using:
```
conda env create -f environment.yml
```
but do try referencing it and you may find it helpful.
Install the EleutherAI/elk library. The version from this commit was used (though trying their newest techniques might be worth a try too). Installation instructions can be found in the README of the provided link. Install to a folder adjacent to directory containing the cloned repository.
- Once installed copy the custom-prompts/AugustasM/ folder into elk/elk/promptsource/templates/, i.e. you want a folder elk/elk/promptsource/templates/AugustasM to exist.
Install the Language Model Evaluation Harness (EleutherAI/lm-evaluation-harness). To make sure my results match with Open LLM Leaderboard (link), this version of the harness was used. Installation instructions can be found in the README. Install to a folder adjacent to directory containing the cloned repository.
The version of the harness used by the Open LLM Leaderboard does not support distributed inference, so the version of the harness on the big-refactor branch was also used. The version from this commit was used, but again, checking the current state of the branch might be worth it. Installation instructions can be found in the README. Install to a folder adjacent to directory containing the cloned repository. To avoid clash with the original harness repository cloned above, you can wrap this version into another folder, e.g. I installed into ~/lm_evalution_harness_refactored/lm-evalution-harness.
- Once installed copy the files in custom-prompts/qnli/* into ~/<wrap-directory>/lm-evalution-harness/tasks/glue/qnli/, e.g. in my case I copied to files to ~/lm_evalution_harness_refactored/lm-evalution-harness/tasks/glue/qnli/ folder. Make sure you copy the files and not the folder, i.e. you want to extend the contents of the existing glue/qnli/ folder.

The following guide might be useful to checkout to a desired commit, but the main command you want to use is this:

git checkout <commit-id>

and then you can use

git show HEAD

to see if you are using the right version of the code.

Usage

There are four main steps to run the method on new data:

Split the dataset and prepare it for reward model training and RL fine-tuning.
Train a reward model.
Performing RL fine-tuning on some pre-trained LLM.
Evaluate the fine-tuned LLM on both target and general NLP tasks.

Not all steps are fully automated, so some manual work has to be done, as explained in more detail below.

Dataset preparation

The first thing that one has to do is prepare the dataset for reward model training and RL fine-tuning. The notebooks under in the src/dataset_formation/ folder can be used for that. Make sure you create a src/dataset_formation/datasets/ folder which is ignored in the remote repository, but will contain the temporary files created in the process before the datasets are pushed to Hugging Face Hub.

Most commonly, the workflow is as follows:

Use the form_train_ppo_datasets.ipynb notebook to first split the training data from a dataset into train and ppo splits. The former will be used for reward model training, and the latter will be used for RL fine-tuning (it is called PPO in the PPO algorithm is used for fine-tuning). The notebook requires some manual work to be done, some cells are commented out when they should not be, but hopefully staring at the code enough will make it clear what is going on, and make sure to refer to my thesis blog post and the pdf (linked above). Roughly speaking:
- Split the data.
- Choose a template(s) to be applied for the train split and apply it.
Use the format_ppo_training_dataset.ipynb to apply a chosen template on the ppo split.
Use the form_val_dataset.ipynb to form the val split. Choosing and applying a template is required this time as well.
Use the push_dataset_to_hub.ipynb notebook to combine the formed temporary files and push them to Hugging Face Hub.

There are many datasets on my HF profile available to use for the later steps of the pipeline so that you don't have to do anything if you do not want to. In particular, the processed QNLI dataset might be of interest (train/val and ppo).

Note that there are more datasets under src/dataset_formation/, but they are the same thing, just specifically shaped to work only for a particular dataset.

Reward model training

Reward model training involves getting a few prerequisites right and then editing and running a batch script that trains a probe on a given datasets and saves the trained weights. A few things to notice:

Make sure you have your conda environment with all of the required dependencies activated.
Create a folder called logs_elk/ adjacent to the whatever you called the folder for the cloned repository (should be called mlmi-thesis by default). The logs about probe training will be saved here.
Create a folder called elk-probes/ which will contain all of the trained probes that we will use to build reward models.

Note that I only provide scripts that can be executed on a computing cluster that uses SLURM to obtain the trained probes. The elk library also provides ways to do this right from command line, check their documentation if this is something that you need.

With the prerequisites out of the way, open the scripts/elk.sh file and edit it to your liking to train a probe that you need. Make sure to carefully look through the file to find all of the available options.

Finally, you are ready to execute the batch script and train a probe. Run the following commands:

cd mlmi-thesis/ # Important!
scripts/launchers/run_elk.sh

RL fine-tuning

Before running the code, create a ppo_logs/ folder adjacent to the cloned repository.

Edit the scripts/ppo_vicuna.sh file to your liking. The script was tested with the distributed data parallel training and using 8 bit quantization. However, other configurations may work as well. Once you are finished with editing the script, execute the following:

cd mlmi-thesis/ # Important!
scripts/launchers/run_ppo_vicuna.sh

Note that this will use wandb logging, you can edit the project name in the src/ppo/configs.py file in the get_ppo_config() function under the tracker_project_name attribute.

If you want to see the code itself, it is contained in the src/ppo/ folder. For example, you might want to do this to change the quantization type (currently, 8 bit quantization is hard-coded).

You can use src/utils/merge_lora_weights.ipynb to merge the trained LoRA matrices into the model and push to hub if needed.

Evaluate the fine-tuned LLM

Finally, you can evaluate the fine-tuned model on the target ask and general NLP tasks. The only target task used in the thesis was the QNLI dataset, so you might have to play around a bit to implement your new custom task. The general NLP tasks are the ones from Open LLM Leaderboard.

As a prerequisite, create a logs_eval/ folder adjacent to the cloned repository.

To evaluate on the QNLI task, edit the scripts/eval_harness_qnli_vicuna.sh file, then execute:

cd mlmi-thesis/ # Important!
scripts/launchers/run_eval_harness_qnli_vicuna.sh

This will run 8 bit inference using the new LoRA weights, or using the original models of an empty string is passed instead of LoRA weights.

To evaluate on the Open LLM Leaderboard datasets, edit the scripts/eval_harness_qnli_vicuna_openllm.sh file. It works very similarly (mostly indentically) to the script above.

After the execution finishes, the results will be available in the logs_eval/ folder.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
assets		assets
custom-prompts		custom-prompts
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

custom-prompts

custom-prompts

scripts

scripts

src

src

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

Repository files navigation

Eliciting latent knowledge from language reward models

Idea

Installation and prerequisites

Usage

Dataset preparation

Reward model training

RL fine-tuning

Evaluate the fine-tuned LLM

About

Releases

Packages

Languages

AugustasMacijauskas/mlmi-thesis

Folders and files

Latest commit

History

Repository files navigation

Eliciting latent knowledge from language reward models

Idea

Installation and prerequisites

Usage

Dataset preparation

Reward model training

RL fine-tuning

Evaluate the fine-tuned LLM

About

Topics

Resources

Stars

Watchers

Forks

Languages