Attiri: Dataset and an instruction-following large language model for Tamil based on LLaMa and Stanford Alpaca

Attiri, an extension of the LLaMa and Stanford Alpaca, aims to build and share an instruction-following language model for the Tamil language. Recent breakthrough in LLM, such as LLaMA, LaMDA and GPT-4, have introduced the potential for Artificial General Intelligence (AGI) and sparked widespread attention from the industry. However, the high cost of training and deployment has made it difficult to promote transparent and open academic research in the field. In response, a project has taken steps to promote open research in the Tamil natural language processing (NLP) community by releasing the Tamil LLaMA model and the Alpaca large model as open-source resources. These models expand the Tamil vocabulary and improve basic semantic understanding by utilizing secondary pre-training on Tamil data. Additionally, the project uses Tamil instruction data for fine-tuning the Tamil LLaMA model, enhancing the model's ability to understand and execute instructions. It is important to note that these resources are solely intended for academic research purposes.

We also release a minimum viable model weight to the huggingface model hub.

The repository contains

Dataset
Code to generate the data
Code to fine tune LLaMA 7B model

Preparation

Setup

To use the program, you must have Python 3.9+ (recommended = 3.9) and the necessary packages installed. You can install the necessary packages using pip:

Create a new Conda environment with Python 3.9:

conda create --name attiri python=3.9

Activate the new environment:

conda activate attiri

pip install -r requirements.txt

Dataset

S.No	Dataset	Description	Count	I/O
1	Attiri-Alpaca	Tamil version of the Stanford Alpaca dataset	52K	Instruction, Input, Output
2	Attiri-Nomic	Tamil version of the Nomic AI GPT4ALL dataset	500K	Prompt, Response
3	IndicCorp	A single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.	31.5M	Sentences

attiri_data.py translates instruction data for the Alpaca dataset from one language to another using the Google Translate API. The program is built with Click and tqdm for command-line argument parsing and progress tracking, respectively.

Attiri Nomic data is available on request, including a csv file with the prompt and response in English and their corresponding tamil translations. To Request : Click Here

Usage

Here are some examples of how to use the program:

Translate data from English to Tamil

Translate data from alpaca_data.json in English to Tamil and save it to output.json:

python attiri_data.py \
--source en \
--target ta \
--dataset alpaca \
--input alpaca_data.json \
--output output.json

Alternatively -s and -t can be used instead of --source and --target and -i and -o can be used instead of --input and --output respectively.

Finetuning the model for tamil

The parameters.json file contains the configuration parameters for running the model. Make sure to update the parameters in parameters.json according to your specific use case before running the model.

Now you can run finetune the model using the following steps:

import attiri.finetune as ft
trainer = ft.LlamaTrainer(BASE_MODEL_PATH, DATAL_PATH )
trainer.train()

A minimum viable model weight is released to the huggingface model hub. You can find it here. (Note this is not a fully working model yet. Further models will be released as the project progresses)

To view a quick demo of the model, please follow the instructions below:

Clone the alpaca-lora repository:

git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
git checkout a48d947

To launch the demo, run the following command:

python generate.py \
  --load_8bit \
  --base_model 'decapoda-research/llama-7b-hf' \
  --lora_weights 'adithya-balaji/attiri-lama' \
  --share_gradio

Citation

Please cite this project if you use the dataset, model or code in this repo. (Note: Naturally you should also cite the original LLaMA, Stanford Alpaca, and LoRa papers)

@misc{Attiri,
  author = {Adithya Balaji},
  title = {Attiri: Dataset and a LLaMa based instruction-following large language model for Tamil},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/adithyab94/Attiri}},
}

To Contribute

This project is actively looking for collaborators. If you are interested in contributing to this project, please raise a pull request or write to me

To-Do

Future pipeline

Datasets

Prepare organic dataset customized to suit Tamil language
Prepare organic dataset customized to suit Kondunthamizh and Romanized Tamil

Extending to other languages

Prepare dataset for other languages
Finetune to create language models

Extending to other LLM

Finetune other LLMs like PaLM, Flan, GPT and Compare results

Toxicity and Abuse Detection

Prepare Toxicity and abuse detection dataset
Finetune to create safe language model

Results

LLaMA 7B model is not finetuned for Tamil. The model is finetuned for languages with Latin script. The Alpaca model hence performs poorly for tamil prompts.

On the other hand, ChatGPT performs better comparitively but it doesnt generate meaningful responses.

Attiri model is finetuned for Tamil and hence performs better than Alpaca just with the pre-release model. Hence it shows great potential to have a large language model customized for Tamil.

Acknowledgments

Thanks for the open source projects - LLaMA, Stanford Alpaca, and Alpaca-Lora from which this project is inspired.

Thanks to the AI4Bharat team for the IndicCorp dataset. and Nomic for the GPT4ALL dataset.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.

Fun-Fact

The word "Attiri" ("அத்திரி") is used by the poet Ilango in the famous Tamil epic Silappadikaram which acccording to the Tamil dictionary could be a camel, a distant relative of the Llamas and Alpacas.

வான வண்கையன் அத்திரி ஏற
மான் அமர் நோக்கியும் வையம் ஏறிக்
கோடி பல அடுக்கிய கொழிநிதிக் குப்பை..

- கடலாடு காதை, சிலப்பதிகாரம்

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
attiri		attiri
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attiri_data.py		attiri_data.py
parameters.json		parameters.json
requirements.txt		requirements.txt

License

adithyab94/Attiri

Folders and files

Latest commit

History

Repository files navigation

Attiri: Dataset and an instruction-following large language model for Tamil based on LLaMa and Stanford Alpaca

Table of Contents

Preparation

Setup

Dataset

Usage

Translate data from English to Tamil

Finetuning the model for tamil

Citation

To Contribute

To-Do

Future pipeline

Datasets

Extending to other languages

Extending to other LLM

Toxicity and Abuse Detection

Results

Acknowledgments

License

Fun-Fact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages