Bark voice cloning

Please read

This code works on python 3.10, i have not tested it on other versions. Some older versions will have issues.

Voice cloning with bark in high quality?

It's possible now.

examples_biden_example.mov

How do I clone a voice?

For developers:

code examples on huggingface model page

For everyone:

Voices cloned aren't very convincing, why are other people's cloned voices better than mine?

Make sure these things are NOT in your voice input: (in no particular order)

Noise (You can use a noise remover before)
Music (There are also music remover tools) (Unless you want music in the background)
A cut-off at the end (This will cause it to try and continue on the generation)
Under 1 second of training data (i personally suggest around 10 seconds for good potential, but i've had great results with 5 seconds as well.)

What makes for good prompt audio? (in no particular order)

Clearly spoken
No weird background noises
Only one speaker
Audio which ends after a sentence ends
Regular/common voice (They usually have more success, it's still capable of cloning complex voices, but not as good at it)
Around 10 seconds of data

Pretrained models

Official

Name	HuBERT Model	Quantizer Version	Epoch	Language	Dataset
quantifier_hubert_base_ls960.pth	HuBERT Base	0	3	ENG	GitMylo/bark-semantic-training
quantifier_hubert_base_ls960_14.pth	HuBERT Base	0	14	ENG	GitMylo/bark-semantic-training
quantifier_V1_hubert_base_ls960_23.pth	HuBERT Base	1	23	ENG	GitMylo/bark-semantic-training

Community

Author	Name	HuBERT Model	Quantizer Version	Epoch	Language	Dataset
HobisPL	polish-HuBERT-quantizer_8_epoch.pth	HuBERT Base	1	8	POL	Hobis/bark-polish-semantic-wav-training
C0untFloyd	german-HuBERT-quantizer_14_epoch.pth	HuBERT Base	1	14	GER	CountFloyd/bark-german-semantic-wav-training

For developers: Implementing voice cloning in your bark projects

Simply copy the files from this directory into your project.
The hubert manager contains methods to download HuBERT and the custom Quantizer model.
Loading the CustomHuBERT should be pretty straightforward
The notebook contains code to use on cuda or cpu. Instead of just cpu.

from hubert.pre_kmeans_hubert import CustomHubert
import torchaudio

# Load the HuBERT model,
# checkpoint_path should work fine with data/models/hubert/hubert.pt for the default config
hubert_model = CustomHubert(checkpoint_path='path/to/checkpoint')

# Run the model to extract semantic features from an audio file, where wav is your audio file
wav, sr = torchaudio.load('path/to/wav') # This is where you load your wav, with soundfile or torchaudio for example

if wav.shape[0] == 2:  # Stereo to mono if needed
    wav = wav.mean(0, keepdim=True)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=sr)

Loading and running the custom kmeans

import torch
from hubert.customtokenizer import CustomTokenizer

# Load the CustomTokenizer model from a checkpoint
# With default config, you can use the pretrained model from huggingface
# With the default setup from HuBERTManager, this will be in data/models/hubert/tokenizer.pth
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth')  # Automatically uses the right layers

# Process the semantic vectors from the previous HuBERT run (This works in batches, so you can send the entire HuBERT output)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Congratulations! You now have semantic tokens which can be used inside of a speaker prompt file.

How do I train it myself?

Simply run the training commands.

A simple way to create semantic data and wavs for training, is with my script: bark-data-gen. But remember that the creation of the wavs will take around the same time if not longer than the creation of the semantics. This can take a while to generate because of that.

For example, if you have a dataset with zips containing audio files, one zip for semantics, and one for the wav files. Inside of a folder called "Literature"

You should run process.py --path Literature --mode prepare for extracting all the data to one directory

You should run process.py --path Literature --mode prepare2 for creating HuBERT semantic vectors, ready for training

You should run process.py --path Literature --mode train for training

And when your model has trained enough, you can run process.py --path Literature --mode test to test the latest model.

Disclaimer

I am not responsible for audio generated using semantics created by this model. Just don't use it for illegal purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.idea		.idea
bark_hubert_quantizer		bark_hubert_quantizer
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
args.py		args.py
colab_notebook.ipynb		colab_notebook.ipynb
notebook.ipynb		notebook.ipynb
prepare.py		prepare.py
process.py		process.py
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py
test_hubert.py		test_hubert.py

License

gitmylo/bark-voice-cloning-HuBERT-quantizer

Folders and files

Latest commit

History

Repository files navigation

Bark voice cloning

Please read

Voice cloning with bark in high quality?

How do I clone a voice?

Voices cloned aren't very convincing, why are other people's cloned voices better than mine?

Pretrained models

Official

Community

For developers: Implementing voice cloning in your bark projects

How do I train it myself?

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages