Name	Name	Last commit message	Last commit date
parent directory ..
images	images
ITREX_StableDiffusionInstructPix2PixPipeline.py	ITREX_StableDiffusionInstructPix2PixPipeline.py
README.md	README.md
diffusion_utils.py	diffusion_utils.py
diffusion_utils_img2img.py	diffusion_utils_img2img.py
export_ir.py	export_ir.py
export_model.sh	export_model.sh
prepare_model.py	prepare_model.py
quantization_modules.py	quantization_modules.py
requirements.txt	requirements.txt
run_executor.py	run_executor.py

Step-by-Step

This document describes the end-to-end workflow for Text-to-image generative AI models across the Neural Engine backend.

Supported Text-to-image Generative AI models:

The inference and accuracy of the above pretrained models are verified in the default configs.

Prerequisite

Prepare Python Environment

Create a python environment, optionally with autoconf for jemalloc support.

conda create -n <env name> python=3.10 [autoconf]
conda activate <env name>

Note: Make sure pip <=23.2.2

Check that gcc version is higher than 9.0.

gcc -v

Install Intel® Extension for Transformers, please refer to installation.

# Install from pypi
pip install intel-extension-for-transformers

# Or, install from source code
cd <intel_extension_for_transformers_folder>
pip install -r requirements.txt
pip install -v .

Install required dependencies for this example

cd <intel_extension_for_transformers_folder>/examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion

pip install -r requirements.txt
pip install transformers==4.34.1
pip install diffusers==0.12.1

Note: Please use transformers no higher than 4.34.1

Environment Variables (Optional)

# Preload libjemalloc.so may improve the performance when inference under multi instance.
conda install jemalloc==5.2.1 -c conda-forge -y
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libjemalloc.so

# Using weight sharing can save memory and may improve the performance when multi instances.
export WEIGHT_SHARING=1
export INST_NUM=<inst num>

Note: This step is optional.

End-to-End Workflow

1. Prepare Models

The stable diffusion mainly includes three sub models:

Text Encoder
Unet
Vae Decoder.

Here we take the CompVis/stable-diffusion-v1-4 as an example.

1.1 Download Models

Export FP32 ONNX models from the hugginface diffusers module, command as follows:

python prepare_model.py --input_model=CompVis/stable-diffusion-v1-4 --output_path=./model

By setting --bf16 to export FP32 and BF16 models.

python prepare_model.py --input_model=CompVis/stable-diffusion-v1-4 --output_path=./model --bf16

For INT8 quantized mode, we only support runwayml/stable-diffusion-v1-5 for now. You need to get a quantized INT8 model first through QAT, Please refer the link. Then by setting --qat_int8 to export INT8 models, to export INT8 model.

python prepare_model.py --input_model=runwayml/stable-diffusion-v1-5 --output_path=./model --qat_int8

1.2 Compile Models

Export three FP32 onnx sub models of the stable diffusion to Nerual Engine IR.

# running the follow bash command to get all IR.
bash export_model.sh --input_model=model --precision=fp32

Export three BF16 onnx sub models of the stable diffusion to Nerual Engine IR.

# running the follow bash command to get all IR.
bash export_model.sh --input_model=model --precision=bf16

Export mixed FP32 & dynamic quantized Int8 IR.

bash export_model.sh --input_model=model --precision=fp32 --cast_type=dynamic_int8

Export mixed BF16 & QAT quantized Int8 IR.

bash export_model.sh --input_model=model --precision=qat_int8

2. Performance

Python API command as follows:

# FP32 IR
python run_executor.py --ir_path=./fp32_ir --mode=latency --input_model=CompVis/stable-diffusion-v1-4

# Mixed FP32 & dynamic quantized Int8 IR.
python run_executor.py --ir_path=./fp32_dynamic_int8_ir --mode=latency --input_model=CompVis/stable-diffusion-v1-4

# BF16 IR
python run_executor.py --ir_path=./bf16_ir --mode=latency --input_model=CompVis/stable-diffusion-v1-4

# QAT INT8 IR
python run_executor.py --ir_path=./qat_int8_ir --mode=latency --input_model=runwayml/stable-diffusion-v1-5

3. Accuracy

Frechet Inception Distance(FID) metric is used to evaluate the accuracy. This case we check the FID scores between the pytorch image and engine image.

By setting --accuracy to check FID socre. Python API command as follows:

# FP32 IR
python run_executor.py --ir_path=./fp32_ir --mode=accuracy --input_model=CompVis/stable-diffusion-v1-4

# Mixed FP32 & dynamic quantized Int8 IR
python run_executor.py --ir_path=./fp32_dynamic_int8_ir --mode=accuracy --input_model=CompVis/stable-diffusion-v1-4

# BF16 IR
python run_executor.py --ir_path=./bf16_ir --mode=accuracy --input_model=CompVis/stable-diffusion-v1-4

# QAT INT8 IR
python run_executor.py --ir_path=./qat_int8_ir --mode=accuracy --input_model=runwayml/stable-diffusion-v1-5

4. Try Text to Image

4.1 Text2Img

Try using one sentence to create a picture!

# Running FP32 models or BF16 models, just import different IR.
# FP32 models
# Note: 
# 1. Using --image to set the path of your image, here we use the default download link.
# 2. The default image is "https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png".
# 3. The default prompt is "Cartoonize the following image".

python run_executor.py --ir_path=./fp32_ir --input_model=CompVis/stable-diffusion-v1-4

# BF16 models
python run_executor.py --ir_path=./bf16_ir --input_model=CompVis/stable-diffusion-v1-4

4.2 Img2Img: instruction-tuning-sd

Try using one image and prompts to create a new picture!

# Running FP32 models or BF16 models, just import different IR.
# BF16 models
python run_executor.py --ir_path=./bf16_ir --input_model=instruction-tuning-sd/cartoonizer --pipeline=instruction-tuning-sd --prompts="Cartoonize the following image" --steps=100

Original image:

Cartoonized image:

5. Validated Result

5.1 Latency (s)

Input: a photo of an astronaut riding a horse on mars

Batch Size: 1

Model	FP32	BF16
CompVis/stable-diffusion-v1-4	10.33 (s)	3.02 (s)

Note: Performance results test on 06/09/2023 with Intel(R) Xeon(R) Platinum 8480+. Performance varies by use, configuration and other factors. See platform configuration for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

5.2 Platform Configuration

Manufacturer	Quanta Cloud Technology Inc
Product Name	QuantaGrid D54Q-2U
OS	CentOS Stream 8
Kernel	5.16.0-rc1-intel-next-00543-g5867b0a2a125
Microcode	0x2b000111
IRQ Balance	Enabled
CPU Model	Intel(R) Xeon(R) Platinum 8480+
Base Frequency	2.0GHz
Maximum Frequency	3.8GHz
CPU(s)	224
Thread(s) per Core	2
Core(s) per Socket	56
Socket(s)	2
NUMA Node(s)	2
Turbo	Enabled
FrequencyGoverner	Performance

Files

stable_diffusion

Directory actions

More options