AnglE📐

It is Angle 📐, not Angel 👼.

📘 document: https://angle.readthedocs.io/en/latest/index.html

📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

✨ Features

Loss:

📐 AnglE loss
⚖ Contrastive loss
📏 CoSENT loss
☕️ Espresso loss (previously known as 2DMSE, detail: README_ESE)

Backbones:

BERT-based models (BERT, RoBERTa, ELECTRA, ALBERT, etc.)
LLM-based models (LLaMA, Mistral, Qwen, etc.)
Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

Single-GPU training
Multi-GPU training

More features will be added in the future.

🏆 Achievements

📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

🤗 Official Pretrained Models

BERT-based models:

🤗 HF	Max Tokens	Pooling Strategy	Scenario
WhereIsAI/UAE-Large-V1	512	cls	English, General-purpose
WhereIsAI/UAE-Code-Large-V1	512	cls	Code Similarity

LLM-based models:

🤗 HF (lora weight)	Backbone	Max Tokens	Prompts	Pooling Strategy	Scenario
SeanLee97/angle-llama-13b-nli	NousResearch/Llama-2-13b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2	NousResearch/Llama-2-7b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement

🚀 Quick Start

⬇️ Installation

python -m pip install -U angle-emb

⌛ Infer BERT-based Model

With Prompts: You can specify a prompt with prompt=YOUR_PROMPT in encode method. If set a prompt, the inputs should be a list of dict or a single dict with key text, where text is the placeholder in the prompt for the input text. You can use other placeholder names. We provide a set of predefined prompts in Prompts class, you can check them via Prompts.list_prompts().

from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).
# When specify prompt, the inputs should be a list of dict with key 'text'
qv = angle.encode({'text': 'what is the weather?'}, to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
    'The weather is great!',
    'it is rainy today.',
    'i am going to bed'
], to_numpy=True)

for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))

Without Prompts: no need to specify a prompt. Just input a list of strings or a single string.

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# for non-retrieval tasks, we don't need to specify prompt when using UAE-Large-V1.
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
])

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

⌛ Infer LLM-based Models

If the pretrained weight is a LoRA-based model, you need to specify the backbone via model_name_or_path and specify the LoRA path via the pretrained_lora_path in from_pretrained method.

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
                              pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
                              pooling_strategy='last',
                              is_llm=True,
                              torch_dtype=torch.float16).cuda()
print('All predefined prompts:', Prompts.list_prompts())
doc_vecs = angle.encode([
    {'text': 'The weather is great!'},
    {'text': 'The weather is very good!'},
    {'text': 'i am going to bed'}
], prompt=Prompts.A)

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

⌛ Infer BiLLM-based Models

Specify apply_billm and billm_model_class to load and infer billm models

import os
# set an environment variable for billm start index
os.environ['BiLLM_START_INDEX'] = '31'

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# specify `apply_billm` and `billm_model_class` to load billm models
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
                              pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
                              pooling_strategy='last',
                              is_llm=True,
                              apply_billm=True,
                              billm_model_class='LlamaForCausalLM',
                              torch_dtype=torch.float16).cuda()

doc_vecs = angle.encode([
    {'text': 'The weather is great!'},
    {'text': 'The weather is very good!'},
    {'text': 'i am going to bed'}
], prompt='The representative word for sentence {text} is:"')

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

⌛ Infer Espresso/Matryoshka Models

Specify layer_index and embedding_size to truncate embeddings.

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()
# specify layer_index and embedding size to truncate embeddings
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], layer_index=22, embedding_size=768)

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

⌛ Infer Third-party Models

You can load any transformer-based third-party models such as mixedbread-ai/mxbai-embed-large-v1, sentence-transformers/all-MiniLM-L6-v2, and BAAI/bge-large-en-v1.5 using angle_emb.

Here is an example:

from angle_emb import AnglE

model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()
vec = model.encode('hello world', to_numpy=True)
print(vec)

🕸️ Custom Train

🗂️ 1. Data Prepation

We currently support three dataset formats:

DatasetFormats.A: it is a pair format with three columns: text1, text2, and label (0/1).
DatasetFormats.B: it is a triple format with three columns: text, positive, and negative. positive and negative store the positive and negative samples of text.
DatasetFormats.C: it is a pair format with two columns: text, positive. positive store the positive sample of text.

You need to prepare your data into huggingface datasets.Dataset in one of the formats in terms of your supervised data.

🚂 2. Train with CLI

Use angle-trainer to train your AnglE model in cli mode.

Single gpu training:

Usage:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-gpu training:

Usage:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 -m angle_emb.angle_trainer --help

🚂 3. Custom Train

from datasets import load_dataset
from angle_emb import AnglE, AngleDataTokenizer


# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls').cuda()

# 2. load dataset
# `text1`, `text2`, and `label` are three required columns.
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score']})
ds = ds.select_columns(["text1", "text2", "label"])

# 3. transform data
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
test_ds = ds['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)

# 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=valid_ds,
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'cosine_w': 1.0,
        'ibn_w': 1.0,
        'angle_w': 1.0,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 20
    },
    fp16=True,
    logging_steps=100
)

# 5. evaluate
corrcoef, accuracy = angle.evaluate(test_ds, device=angle.device)
print('corrcoef:', corrcoef)

💡 Others

To enable llm training, please specify --is_llm 1 and configure appropriate LoRA hyperparameters.
To enable billm training, please specify --apply_billm 1 and configure appropriate billm_model_class such as LLamaForCausalLM (refer to: https://github.com/WhereIsAI/BiLLM?tab=readme-ov-file#usage).
To enable espresso sentence embeddings (ESE), please specify --apply_ese 1 and configure appropriate ESE hyperparameters via --ese_kl_temperature float and --ese_compression_size integer.
To convert the trained AnglE models to sentence-transformers, please run python scripts/convert_to_sentence_transformers.py --help for more details.

💡 4. Fine-tuning Tips

1️⃣ If your dataset format is DatasetFormats.A, it is recommended to slightly increase the weight for cosine_w or slightly decrease the weight for ibn_w.

2️⃣ If your dataset format is DatasetFormats.B, it is recommended to set cosine_w to 0, and increase the weight for ibn_w such as 10 and 20. The angle_tau is recommended to set to 20.0.

3️⃣ If your dataset format is DatasetFormats.C, only ibn_w and ibn_tau are effective. You don't need to tune other parameters.

4️⃣ To alleviate information forgetting in fine-tuning, it is better to specify the teacher_name_or_path. If the teacher_name_or_path equals model_name_or_path, it will conduct self-distillation. It is worth to note that teacher_name_or_path has to have the same tokenizer as model_name_or_path. Or it will lead to unexpected results.

🫡 Citation

You are welcome to use our code and pre-trained models. If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}

📜 ChangeLogs

📅	Description
2024 May 21	support Espresso Sentence Embeddings
2024 Feb 7	support training with only positive pairs (`DatasetFormats.C`)
2023 Dec 4	Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2	Release an English pretrained model: `SeanLee97/angle-llama-13b-nli`
2023 Oct 28	Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md

📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com

© License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.github/workflows		.github/workflows
angle_emb		angle_emb
assets		assets
docs		docs
examples		examples
scripts		scripts
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
README_2DMSE.md		README_2DMSE.md
README_ESE.md		README_ESE.md
README_zh.md		README_zh.md
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

SeanLee97/AnglE

Folders and files

Latest commit

History

Repository files navigation