./perplexity should allow multiple files, and macro-averaging #7066

turian · 2024-05-03T21:40:13Z

Summary

I'm doing new LLM benchmarks, on a novel corpus of documents, using perplexity. However, the difficulty in running ./perplexity with multiple files is getting in the way of my benchmarks.

Related: #2321

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

The ability to specify multiple files for ./perplexity
Less important: The ability to macro-average the PPL and error.

Motivation

./perplexity is demonstrated on a wikitext corpus where all articles are concatenated.

However, one might prefer to compute perplexity over a corpus of documents that are each individual files. This is to prevent windows spanning the boundary between two different documents.

The naive approach of invoking ./perplexity once per document has several issues, the main one being that model load time is expensive. It would be great to have multiple '-f' arguments possible.

Additionally, by less importantly, one might want to macro-average the perplexities. i.e. get the PPL and error for each document, and then average over those. Instead of get the PPL and error over all windows of all documents. The former is preferable if we want all documents to have equal contribution to the PPL, regardless of their length. (One difficulty is deciding the precise way to compute the error correctly in the macro-averaged scenario. There are at least two differing approaches that immediately spring to mind.)

Possible Implementation

Multiple '-f' args are possible in ./perplexity.

'--macro-average' does macro-averaging of the PPL for each document. The macro-averaged error term is NOT displayed when there is more than one '-f', until further discussion decides the appropriate way to do this.

Lastly, a nice workaround would be to specify that '-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-' or somesuch unusual delimiter is the default document break in a single file being read.

The text was updated successfully, but these errors were encountered:

turian · 2024-05-06T02:52:42Z

EDIT: Ignore the following, I am having issues because of #7049

I tried to re-implement perplexity.cpp using llama-cpp-python but so for the python version has perplexity values over 10K, whereas perplexity.cpp is giving around 5-10 for each batch.

from llama_cpp import Llama
import numpy as np
import torch.nn.functional as F
import torch

# Load the GGUF/GGML model
#model_path = "../HuggingFaceModelDownloader/downloads/NousResearch_Hermes-2-Pro-Llama-3-8B-GGUF/Hermes-2-Pro-Llama-3-8B-Q5_K_M.gguf"
#llama = Llama(model_path=model_path, logits_all=True, n_gpu_layers=-1)
model_path = "../HuggingFaceModelDownloader/downloads/MaziyarPanahi_Yi-9B-200K-GGUF/Yi-9B-200K.Q5_K_M.gguf"
llama = Llama(model_path=model_path, logits_all=True, n_batch=512, n_gpu_layers=-1)
# Add flash-attn


# Function to compute log-softmax, NLL, and perplexity
def log_softmax(logits, target_token):
    max_logit = np.max(logits)
    exp_logits = np.exp(logits - max_logit)
    sum_exp_logits = np.sum(exp_logits)
    log_sum_exp_logits = np.log(sum_exp_logits)

    log_softmax_value = logits[target_token] - max_logit - log_sum_exp_logits
    probability = np.exp(log_softmax_value)

    return log_softmax_value, probability

# Method to clear llama context
def clear_llama_context(llama):
    """Clear the llama context and reset the number of tokens."""
    llama.reset()
    llama._ctx.kv_cache_clear()
    llama.input_ids.fill(0)
    llama.scores.fill(0)


# Function to compute NLL and perplexity for a title-abstract pair
def compute_nll_and_perplexity(llama, tokens):
    #input_tokens = tokens
    #input_tokens = [llama.token_bos()] + input_tokens + [llama.token_eos()]
    input_tokens = [llama.token_bos()] + tokens

    # Prompt tokens
    mid_token = len(input_tokens) // 2
    prompt_tokens = tokens[:mid_token]
    completion_tokens = tokens[mid_token:]

    # Start the decoding on the first half
    llama.eval(prompt_tokens)

    # Old start token
    start_token = llama.n_tokens

    print("old", llama.n_tokens)
    print(llama.scores.shape)

    #print(len(input_tokens))

    # Evaluate the model to obtain logits
    #llama.eval(input_tokens[-512:])
    #llama.eval(input_tokens)

    # Continue decoding on the second half
    llama.eval(completion_tokens)

    # Extract the logits for the abstract
    #logits = list(llama.eval_logits[1:-1]

    assert llama.scores.shape[0] == llama.input_ids.shape[0]

    # New number of tokens
    end_token = llama.n_tokens

    print(llama.scores.shape)
    print(llama.input_ids.shape)
    print(llama.scores[start_token:end_token].shape)
    print(llama.input_ids[start_token:end_token].shape)

    print("new", end_token - start_token)


    #mid_token = (end_token + start_token) // 2

    # Calculate log-probabilities and NLLs using log_softmax
    probabilities = []
    nlls = []
    #for logit, target_token in zip(llama.scores[mid_token:end_token], llama.input_ids[mid_token:end_token]):
    for logit, target_token in zip(llama.scores[start_token:end_token], llama.input_ids[start_token:end_token]):
        print(np.array(logit)[target_token], np.max(np.array(logit)), target_token)
        log_softmax_value, probability = log_softmax(np.array(logit), target_token)
        nll = -log_softmax_value

        probabilities.append(probability)
        nlls.append(nll)

    # Print each step
    for i, (prob, nll) in enumerate(zip(probabilities, nlls)):
        print(f"Token {i + 1}: Probability = {prob:.10f}, NLL = {nll:.10f}")

    # Calculate perplexity
    perplexity = np.exp(np.mean(nlls))
    print("perplexity", perplexity)

    return perplexity

#def batch_text(title, abstract, chunk_size=512, stride=256):
def batch_text(llama, title, abstract, chunk_size=512, stride=128):
#def batch_text(llama, title, abstract, chunk_size=256, stride=128):
    # chunk_size should match n_ctx and n_batch
    # Construct the full prompt
    prompt = f"Title: {title}\nAbstract: "

    # Encode the tokens for the prompt and the abstract
    prompt_tokens = llama.tokenize(prompt.encode("utf-8"))
    #abstract_tokens = llama.tokenize(abstract.encode("utf-8"), add_bos=True)
    abstract_tokens = llama.tokenize(abstract.encode("utf-8"))

    # Concatenate prompt and abstract tokens, adding BOS and EOS tokens
    #input_tokens = [llama.token_bos()] + prompt_tokens + abstract_tokens + [llama.token_eos()]
    input_tokens = abstract_tokens

    num_tokens = len(input_tokens)
    for start in range(0, num_tokens, stride):
        end = min(start + chunk_size, num_tokens)
        if end - start < 2:
            break

        print(title, start, end)
        compute_nll_and_perplexity(llama, input_tokens[start:end])
        clear_llama_context(llama)



# Example list of (title, abstract) tuples
title_abstract_pairs = [
    ("AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation", "Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications."),
    ("AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation", open("data/2024-05-01-20:45:31/papers/7bcc-2404.10573v2.txt").read()),
    ("Example Title 1", "This is the abstract of the first example."),
#    ("Example Title 2", "Here is another abstract of a different example.")
]

# Compute perplexity for each title-abstract pair
for title, abstract in title_abstract_pairs:
    #perplexity = compute_nll_and_perplexity(llama, title, abstract)
    batch_text(llama, title, abstract)

turian added the enhancement New feature or request label May 3, 2024

turian mentioned this issue May 6, 2024

Unexpected perplexity of Baichuan2 #4670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

./perplexity should allow multiple files, and macro-averaging #7066

./perplexity should allow multiple files, and macro-averaging #7066

turian commented May 3, 2024 •

edited

turian commented May 6, 2024 •

edited

./perplexity should allow multiple files, and macro-averaging #7066

./perplexity should allow multiple files, and macro-averaging #7066

Comments

turian commented May 3, 2024 • edited

Summary

Prerequisites

Feature Description

Motivation

Possible Implementation

turian commented May 6, 2024 • edited

turian commented May 3, 2024 •

edited

turian commented May 6, 2024 •

edited