Still no 512 or 768 pre trained model? #88

FurkanGozukara · 2023-09-20T21:48:47Z

256 is working but just too bad resolution

relative_find_best_frame_true_square_aspect_ratio_vox.mp4

relative_find_best_frame_false_org_aspect_ratio_vox.mp4

Qia98 · 2023-09-26T03:54:47Z

I'm trying to train 512 or higher resolution. But I met some challenge in getting 512 datasets.

FurkanGozukara · 2023-09-26T11:01:22Z

I'm trying to train 512 or higher resolution. But I met some challenge in getting 512 datasets.

do you have a model i can test?

EdgarMaucourant · 2023-10-01T11:53:40Z

Hey @nieweiqiang ,

You could use the Talking Head 1KH dataset, it is good for faces and had a lot of videos in the 512 or 768 (along with a lot of other to you might want to use a script to resize the videos).

FurkanGozukara · 2023-10-01T20:13:06Z

Hey @nieweiqiang ,

You could use the Talking Head 1KH dataset, it is good for faces and had a lot of videos in the 512 or 768 (along with a lot of other to you might want to use a script to resize the videos).

do you have any pre trained model i can test? or any tutorial how to train ourselves? i can get gpu power

EdgarMaucourant · 2023-10-02T12:39:11Z

I don't have a pre-trained model (yet) as it is still training, and I can't give you a full tutorial as I'm not the author and did not go into all the details but I can give you what I did to get the training on.

First of all you need a dataset to train on, I used this one: https://github.com/tcwang0509/TalkingHead-1KH

The repo only includes the scripts but this is quite easy to use. Just few things to notice:

You need around 2 TB of free disk space to download the full dataset. The scripts will scrap a bunch of videos from Youtube and then will crop the videos to the face and extract interesting parts into smaller videos. At the end you will have around 500K videos ranging from 10 frames to 800 frames.
The scripts in this repo are meant to be used with a Linux environment, I tried to transform the bash scripts into BAT scripts to be used on Windows but as I was short on time I had to abandon this idea. I ended up using WSL2 (Linux on Windows). So either you want to create BAT scripts or install WSL2 and use your windows partitions (automatically mounted into the Linux disto) as a storage. For WSL2 see https://learn.microsoft.com/en-us/windows/wsl/install
The videos cropped from the original videos don't have all the same dimensions (but seems to be squared) so you will have to resize them or exclude the resolutions that you don't want to use. I trained for a 512x512 result so I resized them to that size using the training script of thin plate (see below). You can resize them before the training if you have the script/software to do it, I went for the easy path and used the training script instead.

Once you have your dataset, don't try to extract the frames from the videos, I tried that and you would need more than 10 TB of storage (over 50 Millions frames extracted), and even if you have the time and the storage, despite what the documentation of Thin Plate mentions the scripts are meant to ingest mp4 videos not frames (although some parts seems to handle frames, but the main script only look for mp4 files).

The script expect a hierarchy of folders as input that is not the same as Talking Head (TH) Dataset. So you will have to create a new folder (call it whatever you like, this will be your source folder). Create two subfolders: train and test. Copy (or move) the content of the folder train/cropped_clips from TH to the train folder, copy the content of the folder val/cropped_clips from TH to the test folder.
Also it seems that the script generate a bunch of invalid videos that will make the training to fail so I just removed all files under 20KB in size and that solved it (around 15000 videos removed) .

The hardest part is knowing what to put in the yaml file of the config.
First of all in the config folder, copy/paste one of the existing config file. I used vox-256.yaml as this is what was the closest to my datasets (faces talking). In the file I made the following changes:

In dataset_params:
- Change root_dir to the path of the source folder you created before (make sure you use the source folder path not train or test).
In train_params:
- change num_epochs to 2 (100 is large and you want to test first on a small number of epochs and raise the number if needed)
- change num_repeats to 2 (the datasets as already a very large number of videos as inputs). This would repeats training on the same videos multiple times. In that case 2 times
- change epoch_milestones to [1,2] because you just have 2 epochs
- change batch_size to 5, this number is difficult to estimate and all depends on your GPU Memory. However if it is too large the training will fail quite quickly (under 2 to 5 minutes) and the message will clearly state that torch tried to allocate more memory than available, if it's the case lower the number until it passes. 5 works fine on my RTX3090 with 24GB RAM.
- change dataloader_workers to 0, this should not be necessary, I lowered that number when I was trying to solve the issue with the GPU RAM above and forgot to set it back to 12 so feel free to keep it to 12
- change checkpoint_freq to 1 (because you don't have much epochs)
- change bg_start to 1

The last change is that you want to train on a specific size (512x512 in my case) so you have to make sure the videos are resized to that size. From what I can read in the script file, you should do that by setting the frame_shape settings in your yaml file (under the dataset_params section). However I did not found the correct format for that settings, in the script it is defined as (256,256,3) but it was not working in the YAML file when I used that value.
So I went the easy way and hardcoded the value in the script directly. You can do this by replacing the line 70 in frames_dataset.py from self.frame_shape = frame_shape to self.frame_shape = (512,512,3)

Then you should be good to go! Just run the run.py script in the folder passing in your config file and voilà!

Note that I'm not an expert, and I'm still trying to get an trained model so I don't guarantee these are the best steps, only what I did to get it working so far.

FurkanGozukara · 2023-10-02T12:46:45Z

@EdgarMaucourant awesome man so much explanation

if you already have checkpoints can you send me latest one?

if you don't want to publicly share you can email me : furkangozukara@gmail.com

EdgarMaucourant · 2023-10-02T12:52:45Z

Hey @FurkanGozukara,

I will share them when I will have them but for now this is still training... I will take probably several more days to train as the datasets if large.

FurkanGozukara · 2023-10-02T12:59:20Z

Hey @FurkanGozukara,

I will share them when I will have them but for now this is still training... I will take probably several more days to train as the datasets if large.

awesome looking forward too. you are doing an amazing job

skyler14 · 2023-10-07T02:21:02Z

Anything peculiar coming up while training on higher resolutions? I'm going to follow this

ak01user · 2023-10-07T02:21:42Z

@EdgarMaucourant How is the model training? I trained it for more than ten hours, 200 epochs, image size is 384*384, but the effect is not very good.I plan to continue training

EdgarMaucourant · 2023-10-07T09:14:46Z

actually the training failed after 65 hours without any output :'(
I did not had time to relaunch it until now, so I started with a much small dataset and see how it will go.

FurkanGozukara · 2023-10-08T00:22:07Z

actually the training failed after 65 hours without any output :'( I did not had time to relaunch it until now, so I started with a much small dataset and see how it will go.

sad

Looking forward to results

Qia98 · 2023-10-08T11:35:30Z

I don't have a pre-trained model (yet) as it is still training, and I can't give you a full tutorial as I'm not the author and did not go into all the details but I can give you what I did to get the training on.

First of all you need a dataset to train on, I used this one: https://github.com/tcwang0509/TalkingHead-1KH

The repo only includes the scripts but this is quite easy to use. Just few things to notice:

You need around 2 TB of free disk space to download the full dataset. The scripts will scrap a bunch of videos from Youtube and then will crop the videos to the face and extract interesting parts into smaller videos. At the end you will have around 500K videos ranging from 10 frames to 800 frames.

The scripts in this repo are meant to be used with a Linux environment, I tried to transform the bash scripts into BAT scripts to be used on Windows but as I was short on time I had to abandon this idea. I ended up using WSL2 (Linux on Windows). So either you want to create BAT scripts or install WSL2 and use your windows partitions (automatically mounted into the Linux disto) as a storage. For WSL2 see https://learn.microsoft.com/en-us/windows/wsl/install

The videos cropped from the original videos don't have all the same dimensions (but seems to be squared) so you will have to resize them or exclude the resolutions that you don't want to use. I trained for a 512x512 result so I resized them to that size using the training script of thin plate (see below). You can resize them before the training if you have the script/software to do it, I went for the easy path and used the training script instead.

Once you have your dataset, don't try to extract the frames from the videos, I tried that and you would need more than 10 TB of storage (over 50 Millions frames extracted), and even if you have the time and the storage, despite what the documentation of Thin Plate mentions the scripts are meant to ingest mp4 videos not frames (although some parts seems to handle frames, but the main script only look for mp4 files).

The script expect a hierarchy of folders as input that is not the same as Talking Head (TH) Dataset. So you will have to create a new folder (call it whatever you like, this will be your source folder). Create two subfolders: train and test. Copy (or move) the content of the folder train/cropped_clips from TH to the train folder, copy the content of the folder val/cropped_clips from TH to the test folder. Also it seems that the script generate a bunch of invalid videos that will make the training to fail so I just removed all files under 20KB in size and that solved it (around 15000 videos removed) .

The hardest part is knowing what to put in the yaml file of the config. First of all in the config folder, copy/paste one of the existing config file. I used vox-256.yaml as this is what was the closest to my datasets (faces talking). In the file I made the following changes:

In dataset_params:

Change root_dir to the path of the source folder you created before (make sure you use the source folder path not train or test).

In train_params:

change num_epochs to 2 (100 is large and you want to test first on a small number of epochs and raise the number if needed)

change num_repeats to 2 (the datasets as already a very large number of videos as inputs). This would repeats training on the same videos multiple times. In that case 2 times

change epoch_milestones to [1,2] because you just have 2 epochs

change batch_size to 5, this number is difficult to estimate and all depends on your GPU Memory. However if it is too large the training will fail quite quickly (under 2 to 5 minutes) and the message will clearly state that torch tried to allocate more memory than available, if it's the case lower the number until it passes. 5 works fine on my RTX3090 with 24GB RAM.

change dataloader_workers to 0, this should not be necessary, I lowered that number when I was trying to solve the issue with the GPU RAM above and forgot to set it back to 12 so feel free to keep it to 12

change checkpoint_freq to 1 (because you don't have much epochs)

change bg_start to 1

The last change is that you want to train on a specific size (512x512 in my case) so you have to make sure the videos are resized to that size. From what I can read in the script file, you should do that by setting the frame_shape settings in your yaml file (under the dataset_params section). However I did not found the correct format for that settings, in the script it is defined as (256,256,3) but it was not working in the YAML file when I used that value. So I went the easy way and hardcoded the value in the script directly. You can do this by replacing the line 70 in frames_dataset.py from self.frame_shape = frame_shape to self.frame_shape = (512,512,3)

Then you should be good to go! Just run the run.py script in the folder passing in your config file and voilà!

Note that I'm not an expert, and I'm still trying to get an trained model so I don't guarantee these are the best steps, only what I did to get it working so far.

I change the same things for 512 training. The datasets I used is voxceled2. I resize the datasets to 512 and transfrom mp4 format to png. It costs about 11TB(only a part). If I use mp4 for training, it costs about 10 hours per epoch. But in png format, it costs about 1 hours per epoch. Total about 3days
The config of my training is :
num_epochs: 100
num_repeats: 200 (the datasets is only a part, so I increase the num_repeats)
batch_size: 8
Other parameters the same with vox-256
And also, in frames_dataset.py I change the image size by hard setting.
but I did't get a good checkpoint for work

Qia98 · 2023-10-08T11:42:01Z

actually the training failed after 65 hours without any output :'( I did not had time to relaunch it until now, so I started with a much small dataset and see how it will go.

Can I see your log.txt? My traing is normal.
The loss is stable and convergence.
From perceptual - 99.74809; equivariance_value - 0.39179; warp_loss - 5.25956; bg - 0.25512 to perceptual - 68.15993; equivariance_value - 0.15263; warp_loss - 0.67301; bg - 0.03551

Qia98 · 2023-10-09T09:12:10Z

When training the 512 model, I noticed that the visualized picture appears to have been cropped.

Has anyone ever encountered this problem? I want to know whether there's something wrong with my frame_dataset.py or the dataset format.

EdgarMaucourant · 2023-10-09T10:06:14Z

Hi @nieweiqiang ,

Probably that the code to generate that vis is hardcoded to 256x256, I did not look at the code but I would supect that.

On my end I'm giving up. Sorry guys, I was doing this on my spare time, and what ever I tried it fails at some point because I'm lacking memory or space on my computer (32 GB RAM is not enough I think, of maybe this is the GPU RAM). I tried to reduce the number of repeats the number of items in the dataset, but whatever I do it fails at some point and I'm lacking time to look into this further more.

I hope that what I shared above for the yaml file was insightful and I wish you all the best for training a model!

FurkanGozukara · 2023-10-09T13:02:39Z

Hi @nieweiqiang ,

Probably that the code to generate that vis is hardcoded to 256x256, I did not look at the code but I would supect that.

On my end I'm giving up. Sorry guys, I was doing this on my spare time, and what ever I tried it fails at some point because I'm lacking memory or space on my computer (32 GB RAM is not enough I think, of maybe this is the GPU RAM). I tried to reduce the number of repeats the number of items in the dataset, but whatever I do it fails at some point and I'm lacking time to look into this further more.

I hope that what I shared above for the yaml file was insightful and I wish you all the best for training a model!

so sad to hear :(

thhung · 2023-10-12T12:05:57Z

@FurkanGozukara Do you plan to continue the work of @EdgarMaucourant ?

FurkanGozukara · 2023-10-13T12:16:23Z

@FurkanGozukara Do you plan to continue the work of @EdgarMaucourant ?

i have 0 idea right now how to prepare dataset and start training

ak01user · 2023-10-29T03:01:14Z

When training the 512 model, I noticed that the visualized picture appears to have been cropped.
Has anyone ever encountered this problem? I want to know whether there's something wrong with my frame_dataset.py or the dataset format.

This phenomenon occurs when I interrupt the program during saving.

huangxin168 · 2023-11-17T05:39:58Z

**Qia98 ** commented Oct 9, 2023 •
have you solve the prblem? also want to train 512 model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still no 512 or 768 pre trained model? #88

Still no 512 or 768 pre trained model? #88

FurkanGozukara commented Sep 20, 2023

Qia98 commented Sep 26, 2023

FurkanGozukara commented Sep 26, 2023

EdgarMaucourant commented Oct 1, 2023

FurkanGozukara commented Oct 1, 2023

EdgarMaucourant commented Oct 2, 2023 •

edited

FurkanGozukara commented Oct 2, 2023 •

edited

EdgarMaucourant commented Oct 2, 2023

FurkanGozukara commented Oct 2, 2023

skyler14 commented Oct 7, 2023

ak01user commented Oct 7, 2023

EdgarMaucourant commented Oct 7, 2023

FurkanGozukara commented Oct 8, 2023

Qia98 commented Oct 8, 2023

Qia98 commented Oct 8, 2023 •

edited

Qia98 commented Oct 9, 2023 •

edited

EdgarMaucourant commented Oct 9, 2023

FurkanGozukara commented Oct 9, 2023

thhung commented Oct 12, 2023

FurkanGozukara commented Oct 13, 2023

ak01user commented Oct 29, 2023

huangxin168 commented Nov 17, 2023

Still no 512 or 768 pre trained model? #88

Still no 512 or 768 pre trained model? #88

Comments

FurkanGozukara commented Sep 20, 2023

Qia98 commented Sep 26, 2023

FurkanGozukara commented Sep 26, 2023

EdgarMaucourant commented Oct 1, 2023

FurkanGozukara commented Oct 1, 2023

EdgarMaucourant commented Oct 2, 2023 • edited

FurkanGozukara commented Oct 2, 2023 • edited

EdgarMaucourant commented Oct 2, 2023

FurkanGozukara commented Oct 2, 2023

skyler14 commented Oct 7, 2023

ak01user commented Oct 7, 2023

EdgarMaucourant commented Oct 7, 2023

FurkanGozukara commented Oct 8, 2023

Qia98 commented Oct 8, 2023

Qia98 commented Oct 8, 2023 • edited

Qia98 commented Oct 9, 2023 • edited

EdgarMaucourant commented Oct 9, 2023

FurkanGozukara commented Oct 9, 2023

thhung commented Oct 12, 2023

FurkanGozukara commented Oct 13, 2023

ak01user commented Oct 29, 2023

huangxin168 commented Nov 17, 2023

EdgarMaucourant commented Oct 2, 2023 •

edited

FurkanGozukara commented Oct 2, 2023 •

edited

Qia98 commented Oct 8, 2023 •

edited

Qia98 commented Oct 9, 2023 •

edited