diff --git a/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/README.md b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9e3861d006534736dee2b30792e3558251ad14b8 --- /dev/null +++ b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/README.md @@ -0,0 +1,528 @@ +# MindIE SD + +## 一、介绍 +MindIE SD是MindIE的视图生成推理模型套件,其目标是为稳定扩散(Stable Diffusion, SD)系列大模型推理任务提供在昇腾硬件及其软件栈上的端到端解决方案,软件系统内部集成各功能模块,对外呈现统一的编程接口。 + +## 二、安装依赖 + +MindIE-SD其依赖组件为driver驱动包、firmware固件包、CANN开发套件包、推理引擎MindIE包,使用MindIE-SD前请提前安装这些依赖。 + +| 简称 | 安装包全名 | 默认安装路径 | 版本约束 | +| --------------- |---------------------------------------------------------------------------|--------------------------------------|-----------------------------------| +| driver驱动包 | 昇腾310P处理器对应驱动软件包:Ascend-hdk-310p-npu-driver_\{version\}\_{os}\-{arch}.run | /usr/local/Ascend | 24.0.rc1及以上 | +| firmware固件包 | 昇腾310P处理器对应固件软件包:Ascend-hdk-310p-npu-firmware_\{version\}.run | /usr/local/Ascend | 24.0.rc1及以上 | +| CANN开发套件包 | Ascend-cann-toolkit\_{version}_linux-{arch}.run | /usr/local/Ascend/ascend-toolkit/latest | 8.0.RC1及以上 | +| 推理引擎MindIE包 | Ascend-mindie\_\{version}_linux-\{arch}.run | /usr/local/Ascend/mindie/latest | 和mindietorch严格配套使用 | +| torch | Python的whl包:torch-{version}-cp310-cp310-{os}_{arch}.whl | - | Python版本3.10.x,torch版本支持2.1.0 | + +- {version}为软件包版本 +- {os}为系统名称,如Linux +- {arch}为架构名称,如x86_64 + +### 2.1 安装驱动和固件 + +1. 获取地址 +- [800I A2](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=32&cann=8.0.RC1.beta1&driver=1.0.RC1.alpha) +- [Duo卡](https://www.hiascend.com/hardware/firmware-drivers/community?product=2&model=17&cann=8.0.RC2.alpha002&driver=1.0.22.alpha) +2. [安装指导手册](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0019.html) +### 2.2 CANN开发套件包+kernel包+MindIE包下载 +1. 下载: +- [800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) +- [Duo卡](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=2&model=17) +2. [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) + +3. 快速安装: +- CANN开发套件包+kernel包安装 +```commandline +# 增加软件包可执行权限,{version}表示软件版本号,{arch}表示CPU架构,{soc}表示昇腾AI处理器的版本。 +chmod +x ./Ascend-cann-toolkit_{version}_linux-{arch}.run +chmod +x ./Ascend-cann-kernels-{soc}_{version}_linux.run +# 校验软件包安装文件的一致性和完整性 +./Ascend-cann-toolkit_{version}_linux-{arch}.run --check +./Ascend-cann-kernels-{soc}_{version}_linux.run --check +# 安装 +./Ascend-cann-toolkit_{version}_linux-{arch}.run --install +./Ascend-cann-kernels-{soc}_{version}_linux.run --install + +# 设置环境变量 +source /usr/local/Ascend/ascend-toolkit/set_env.sh +``` +- MindIE包安装 +```commandline +# 增加软件包可执行权限,{version}表示软件版本号,{arch}表示CPU架构。 +chmod +x ./Ascend-mindie_${version}_linux-${arch}.run +./Ascend-mindie_${version}_linux-${arch}.run --check + +# 方式一:默认路径安装 +./Ascend-mindie_${version}_linux-${arch}.run --install +# 设置环境变量 +cd /usr/local/Ascend/mindie && source set_env.sh + +# 方式二:指定路径安装 +./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +# 设置环境变量 +cd ${AieInstallPath}/mindie && source set_env.sh +``` + +- MindIE SD不需要单独安装,安装MindIE时将会自动安装 +- torch_npu 安装: +下载 pytorch_v{pytorchversion}_py{pythonversion}.tar.gz +```commandline +tar -xzvf pytorch_v{pytorchversion}_py{pythonversion}.tar.gz +# 解压后,会有whl包 +pip install torch_npu-{pytorchversion}.xxxx.{arch}.whl +``` + +### 2.3 pytorch框架(支持版本为:2.1.0) +[安装包下载](https://download.pytorch.org/whl/cpu/torch/) + +使用pip安装 +```shell +# {version}表示软件版本号,{arch}表示CPU架构。 +pip install torch-${version}-cp310-cp310-linux_${arch}.whl +``` + +### 2.4 安装依赖库 +安装MindIE-SD的依赖库。 +``` +pip install -r requirements.txt +``` + + +## 三、Opensoraplan1.3 + +### 3.1 权重及配置文件说明 + +1. text_encoder和tokenizer: +- 配置文件和权重文件 +```shell + https://huggingface.co/google/mt5-xxl/tree/main +``` +2. transformer: +- 配置文件和权重文件 +```shell + https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640 +``` +3. VAE: +- 配置文件和权重文件 +```shell +https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main +``` + +### 3.2 执行推理脚本 +```shell +ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node 4 --master_port 29516 \ + inference_opensoraplan13.py \ + --model_path /path/to/transformer/ \ + --num_frames 93 \ + --height 640 \ + --width 640 \ + --text_encoder_name_1 "/path/to/text/encoder" \ + --text_prompt prompt.txt \ + --ae WFVAEModel_D8_4x8x8 \ + --ae_path "/path/to/vae" \ + --save_img_path "./video/save/path" \ + --fps 24 \ + --guidance_scale 7.5 \ + --num_sampling_steps 100 \ + --max_sequence_length 512 \ + --seed 1234 \ + --num_samples_per_prompt 1 \ + --rescale_betas_zero_snr \ + --prediction_type "v_prediction" \ + --save_memory \ + --sp \ +``` +ASCEND_RT_VISIBLE_DEVICES 指定特定的NPU进行计算 +--nproc_per_node 控制总NPU卡数进行计算 + +--model_path 指定transformers(DiT)模型权重配置路径, 下面包含config文件和权重文件 + +--num_frames 设置生成的总帧数 + +--height 设置输出图像的高度为多少像素 + +--width 设置输出图像的宽度为多少像素 + +--text_encoder_name_1 指定text_encoder权重配置路径 + +--text_prompt 指定输入的文本提示, 可以是一个txt文件或者一个prompt字符 + +--ae VAE的对视频的压缩规格 + +--ae_path 指定VAE模型权重配置路径 + +--fps 设置帧率 + +--guidance_scale 设置引导比例,用于控制negative文本对视频生成的影响程度 + +--num_sampling_steps 设置采样步骤的数量 + +--max_sequence_length 设置prompt的最大长度, 默认为512 + +--num_samples_per_prompt 设置每个提示生成的样本数, 默认为1 + +--rescale_betas_zero_snr schedular 的配置 + +--prediction_type schedular 的配置 + +--save_memory 运行VAE时尽量节省内存, 当生成视频较大时,要开启 + +--sp 是否开启序列并行 + + +### 3.3 GPU 精度对比 +在NPU和GPU 经行对比时,需要保持2台服务器的CPU架构是一致的。 +这样可以控制相同的随机种子,在CPU上产生相同的随机数,再传给GPU或NPU。 + +下面为应该做的修改。 + +- 1.mindiesd/pipeline/open_sora_plan_pipeline.py +386 行, 生成latent data处 +```shell +latents = torch.randn(shape, dtype=dtype, device=device) +``` + +改为 + +```shell +torch.manual_seed(seed) +latents = torch.randn(shape).to(dtype).to(device) +``` +GPU上的代码, 请在opensoraplan1.3开源代码: +/Open-Sora-Plan/opensora/sample/pipeline_opensora.py 第435行 + +```shell +latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) +``` + +改为 + +```shell +latents = randn_tensor(shape, generator=generator).to(dtype).to(device) +``` + +为了避免误差。 此处在rand前设置GPU和NPU上是相同的随机种子。 + +- 2.mindiesd/schedulers/scheduling_euler_ancestral_discrete.py +365行 + +```shell +noise = torch.randn(model_output.shape, dtype=model_output.dtype, device=device) +``` + +改为 + +```shell +noise = torch.randn(model_output.shape).to(model_output.dtype).to(device) +``` +GPU上的代码, 在diffusers 库中的schelular修改。 +diffusers/schedulers/scheduling_euler_ancestral_discrete.py +第427行 +```shell +noise = randn_tensor(model_output.shape, dtype=model_output.dtype, device=device, generator=generator) +``` + +改为 + +```shell +noise = randn_tensor(model_output.shape, generator=generator).to(model_output.dtype).to(device) +``` + +### 3.5 Vbench精度测试(基于gpu) +设置视频路径 +```shell +input_path='./32_512_512_videos/' +``` + +设置prompt路径 +```shell +prompt_path='./t2v_sora.txt' +``` + +设置json文件路径: +```shell +json_output_path='./bf16_32_512_512.json' +``` + +1. 生成json文件 +```shell +python generate_prompt.py --input_path ${input_path} --prompt_path ${prompt_path} --output_path ${json_output_path} +``` +2. 下载vbench代码仓 +```shell +git clone https://github.com/Vchitect/VBench.git +git checkout bbbf5854e7c1cbada3bd321d1f2037af890fb343 +``` +3. 执行精度评估脚本,获取精度值 + +设置deviceId: +```shell +device_id=0 +``` + +设置VBench权重路径: +```shell +vbench_weight_path='./weight/VBench' +``` + +设置Vbench精度测试结果路径: +```shell +output_path='./vbench_result/' +``` +```shell +cd VBench +CUDA_VISIBLE_DEVICES=${device_id} python evaluate.py --prompt_file ${json_output_path} \ + --dimension 'subject_consistency' 'motion_smoothness' 'dynamic_degree' 'aesthetic_quality' 'imaging_quality' 'overall_consistency' \ + --mode custom_input --videos_path ${input_path} \ + --load_ckpt_from_local ${vbench_weight_path} \ + --output_path ${output_path} +``` + +## 四、DiT cache. + +### 4.1 背景 +当前业界文生图/视频的主流模型架构为DiT (Diffusion Transformer),在实际的迭代采样推理的步骤中,大多存在较多特征相似/计算冗余,在进行相关cache操作时主要面临以下问题与挑战: + +1、模型冗余计算的cache策略,严重影响处理后的模型精度 +2、不同模型步间特征相似性差异明显,需要着重考虑跨模型泛化能力 +3、算法与其他推理加速优化算法的系统兼容性问题 + +### 4.2 原理介绍 +1、核心思想 +基于相邻采样step间的激活相似性,复用模型局部特征,减少冗余计算,显著加速推理。 + +2、Cache原理:引入当前输入偏置量 +基于DiT模型的步间特征相似性特性,DiT-Cache不单纯使用前一时刻的激活结果,而是通过引入当前时刻的输入偏置量,与前一时刻的激活cache相结合,在减少计算量的同时保持cache后的模型精度。 + +3、Cache策略:跨step策略差异 +为了保证模型精度,考虑不同step之间的特征相似性不同,对不同采样step施加不同的采样策略,包括cache比例、位置等不同维度和细粒度。 + +4、算法正交性:全量计算部分可叠加其他推理优化算法 +DiT-Cache与DSP等并行算法正交,在全量计算layer可以叠加并行算法,DiT-cache复用特征减少计算量,DSP多卡并行提升计算效率,从系统推理角度最大化加速收益。 + +### 4.3 使用说明(以OpenSora1.3, Ditcache static 为例) +因为cache策略不同,可能会衍生出不同的cache版本。 下面以其中的cache static策略为例,说明如何使用cache算法。 +1、参数配置 + +| 参数名 | 类型 | 说明 | +|------------------|--------|--------------------------------------------------------------| +| use_cache | bool | 是否启用缓存机制,默认值为True。 | +| cache_step_interval | int | 缓存数据的间隔步数。不建议用户修改默认值,如需修改,需要保证不要超过迭代的最大步数。默认值为2。 | +| cache_block_start | int | 开始缓存的Block层数。不建议用户修改默认值,如需修改,需要保证不要超过模型最大Block层数。默认值为3。 | +| cache_num_blocks | int | 缓存的Block层数。不建议用户修改默认值,如需修改,需要保证不要超过模型最大Block层数。默认值为13。 | +| cache_step_start | int | 开始缓存的迭代步数。不建议用户修改默认值,如需修改,需要保证不要超过迭代的最大步数。默认值为5。 | + +2、推理修改 +若不启用ditcache策略,则使用原有的forward函数 +若启用ditcache策略,则将DitBlock的forward推理过程,分成3阶段进行推理: + +- 推理 [0, cache_start) +- 推理 [cache_start, cache_end),若当前迭代步不使用cache,则计算delta_cache 给下个迭代步使用。若当前迭代步要使用cache,则使用上个迭代步的delta_cache。 +- 推理 [cache_end, num_blocks) + +```shell +def _transformer_blocks_inference( + self, + step_id, + transformer_inputs, + timestep, + video_info + ): + if not self.use_cache or (self.use_cache and step_id < self.cache_step_start): + hidden_states = self._transformer_blocks_forward( + transformer_inputs=transformer_inputs, + timestep=timestep, + video_info=video_info, + start_id=0, + end_id=self.config.num_layers + ) + else: + _, encoder_hidden_states, sparse_mask = transformer_inputs + # 1.0 infer [0, cache_start) + hidden_states_pre_cache = self._transformer_blocks_forward( + transformer_inputs=transformer_inputs, + timestep=timestep, + video_info=video_info, + start_id=0, + end_id=self.cache_block_start + ) + transformer_inputs = (hidden_states_pre_cache, encoder_hidden_states, sparse_mask) + # 2.0 infer [cache_start, cache_end) + cache_end = min(self.cache_block_start + self.cache_num_blocks, self.config.num_layers) + if (step_id - self.cache_step_start) % self.cache_step_interval == 0: + hidden_states = self._transformer_blocks_forward( + transformer_inputs=transformer_inputs, + timestep=timestep, + video_info=video_info, + start_id=self.cache_block_start, + end_id=cache_end + ) + self.delta_cache = hidden_states - hidden_states_pre_cache + else: + hidden_states = hidden_states_pre_cache + self.delta_cache + # 3.0 infer [cache_end, num_blocks) + if cache_end < self.config.num_layers: + transformer_inputs = (hidden_states, encoder_hidden_states, sparse_mask) + hidden_states = self._transformer_blocks_forward( + transformer_inputs=transformer_inputs, + timestep=timestep, + video_info=video_info, + start_id=cache_end, + end_id=self.config.num_layers + ) + return hidden_states +``` + +### 4.4 自适应搜索算法使用说明 +不同模型结构不同,step间的激活相似度也存在差异,为了保证DiT-Cache在不同模型上的精度和性能收益,设计了一种递进式搜索策略,能够在当前模型上搜索到最优的cache策略,提升跨模型泛化能力。 + + +## 五. CacheManager 类 +cacheManager类作为一个接口,用做调用所有的cache算法接口。 +使用时,只用根据自己希望使用的cache 策略,调用配置参数。 +目前有两个 config 类 CacheAgentConfig, 和 DitCacheConfig 。 配置这两个参数,要根据后续提供的参数搜索工具,去自动搜索最佳的参数设置。 + +下面介绍如何具体的去使用。 + + Ditcache static主要有4个参数配置。 + +cache_step_interval, cache_block_start , cache_num_blocks, cache_step_start。 + +假设现在已经得到最优的参数配置,下面介绍如何根据这些参数配置去使用CacheManager + +使用时 +```python +from mindiesd.layers.cache_mgr import CacheManager, DitCacheConfig +``` + +在DiT模块里,源代码如下。 +```python + +for block_idx, block in enumerate(self.transformer_blocks): + if i > 1 and i < 30: + mask_group = sparse_mask.get(block.attn1.processor.sparse_n, None) + attention_mask, encoder_attention_mask = mask_group.get(block.attn1.processor.sparse_group, None) + else: + mask_group = sparse_mask.get(1, None) + attention_mask, encoder_attention_mask = mask_group.get(block.attn1.processor.sparse_group, None) + + hidden_states = block(hidden_states, + attention_mask=attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + timestep=timestep, frame=frame, height=height, width=width, + ) +``` + +```python + +### 在期望的地方,进行初始化。 +config = DitCacheConfig(step_start=cache_step_start, step_interval=cache_step_interval, + block_start=cache_block_start, num_blocks=cache_num_blocks) + +cache = CacheManager(config) + + +### 修改DiT 代码中的block 使用处的代码 +for block_idx, block in enumerate(self.transformer_blocks): + if i > 1 and i < 30: + mask_group = sparse_mask.get(block.attn1.processor.sparse_n, None) + attention_mask, encoder_attention_mask = mask_group.get(block.attn1.processor.sparse_group, None) + else: + mask_group = sparse_mask.get(1, None) + attention_mask, encoder_attention_mask = mask_group.get(block.attn1.processor.sparse_group, None) + # 修改这里 + hidden_states = cache(block, step_id, block_idx, + hidden_states, + attention_mask=attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + timestep=timestep, frame=frame, height=height, width=width, + ) + +``` +只需要将block 作为入参,传给cache。 同时传入参数 step_id 和block_idx, 分别为当前的采样步数和当前block的索引。 +这两个参数都是从0开始。 + +后续的参数,就是原本需要传给block的参数。 + + +## 六 AdaStep 采样优化 + +### 6.1 介绍 +在训练阶段,我们希望模型能够快速收敛,因此我们希望模型能够快速找到最优的采样步数。因此我们设计了一种自适应的采样步数算法,能够根据当前模型的表现,自动调整采样步数,从而提升模型的收敛速度。 + +### 6.2 使用说明 + +AdaStep算法使用时,需要配置三个参数 skip_thr, max_skip_steps, decay_ratio。 这三个参数的含义如下。 + +skip_thr: 当步数间的差异,小于skip_thr, 就会跳过当前步数,直接跳到下一个步数。 +max_skip_steps: 当连续跳过max_skip_steps采样步数,就会进行一次强制计算。 +decay_ratio: skip_thr随着采样步数衰减。 + +使用时,只需要将AdaStep类作为入参,传给DiT模块即可。 +引用 +```python +from mindiesd.pipeline.sampling_optm import AdaStep +``` +```python +### 1 初始化 +skip_strategy = AdaStep(skip_thr=0.015, max_skip_steps=4, decay_ratio=0.99, device="npu") +pipeline.skip_strategy = skip_strategy + + +### 在pipeline denoise 的地方 + +for step_id, t in enumerate(tqdm(timesteps)): + # expand the latents if we are doing classifier free guidance + latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents + scale_model = False + if hasattr(self.scheduler, "scale_model_input"): + latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + scale_model = True + # Expand scalar t to 1-D tensor to match the 1st dim of latent_model_input + timestep = torch.tensor([t] * latent_model_input.shape[0], device=self.device).to( + dtype=latent_model_input.dtype) + + # 2. 新增判断, 如果使用采样优化, 将self.transformer 作为参数传入 + # 其他的参数为本来的参数设置。 + if self.skip_strategy: + noise_pred = self.skip_strategy(self.transformer, + latent_model_input, + attention_mask=attention_mask, + encoder_hidden_states=prompt_embeds, + encoder_attention_mask=prompt_attention_mask, + timestep=timestep, + pooled_projections=prompt_embeds_2, + step_id=step_id, + ) + + else: + noise_pred = self.transformer( + latent_model_input, + attention_mask=attention_mask, + encoder_hidden_states=prompt_embeds, + encoder_attention_mask=prompt_attention_mask, + timestep=timestep, + pooled_projections=prompt_embeds_2, + step_id=step_id, + )[0] + + if self.do_classifier_free_guidance: + noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) + noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) + + if self.do_classifier_free_guidance and guidance_rescale > 0.0 and scale_model: + noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale) + # Compute the previous noisy sample x_t -> x_t-1 + latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0] + + ## 3. 更新采样优化的状态 + if self.skip_strategy: + self.skip_strategy.update_strategy(latents , sequence_parallel=True) +## 4 最后重置状态 +if self.skip_strategy: + self.skip_strategy.reset_status() + +``` diff --git a/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13.py b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13.py new file mode 100644 index 0000000000000000000000000000000000000000..967a8a2ff6e04e43ea5a49690f69adef13a555fe --- /dev/null +++ b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2024 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import time +import logging + +import torch +import torch_npu +import torch.distributed as dist +import imageio + +from transformers import AutoTokenizer, MT5EncoderModel +from mindiesd.pipeline.open_soar_plan_pipeline import OpenSoraPlanPipeline13 +from mindiesd.schedulers.scheduling_euler_ancestral_discrete import EulerAncestralDiscreteScheduler +from mindiesd.models.t2vdit import OpenSoraT2Vv1_3 +from mindiesd.models.wfvae import WFVAEModelWrapper, ae_stride_config +from mindiesd.utils import set_random_seed +from mindiesd.models.parallel_mgr import init_parallel_env, get_sequence_parallel_rank + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + + +def parse_arguments(): + parser = argparse.ArgumentParser(description='Test Pipeline Argument Parser') + + parser.add_argument('--model_path', type=str, required=True, help='Path to the model directory') + parser.add_argument('--version', type=str, default='v1_3', help='Version of the model') + parser.add_argument('--dtype', type=str, default='fp16', help='Data type used in inference') + parser.add_argument('--num_frames', type=int, default=93, help='Number of frames') + parser.add_argument('--height', type=int, default=720, help='Height of the frames') + parser.add_argument('--width', type=int, default=1280, help='Width of the frames') + parser.add_argument('--text_encoder_name_1', type=str, required=True, help='Path to the text encoder model') + parser.add_argument('--text_prompt', type=str, required=True, help='Text prompt for the model') + parser.add_argument('--ae', type=str, default='WFVAEModel_D8_4x8x8', help='Autoencoder model type') + parser.add_argument('--ae_path', type=str, required=True, help='Path to the autoencoder model') + parser.add_argument('--save_img_path', type=str, default='./test', help='Path to save images') + parser.add_argument('--fps', type=int, default=24, help='Frames per second') + parser.add_argument('--guidance_scale', type=float, default=7.5, help='Guidance scale for the model') + parser.add_argument('--num_sampling_steps', type=int, default=10, help='Number of sampling steps') + parser.add_argument('--max_sequence_length', type=int, default=512, help='Maximum sequence length') + parser.add_argument('--seed', type=int, default=1234, help='Random seed') + parser.add_argument('--num_samples_per_prompt', type=int, default=1, help='Number of samples per prompt') + parser.add_argument('--rescale_betas_zero_snr', action='store_true', help='Rescale betas zero SNR') + parser.add_argument('--prediction_type', type=str, default='v_prediction', help='Type of prediction') + parser.add_argument('--save_memory', action='store_true', help='Save memory during processing') + parser.add_argument('--enable_tiling', action='store_true', help='Enable tiling for processing') + parser.add_argument('--sp', action='store_true') + parser.add_argument('--use_cache', action='store_true') + parser.add_argument('--cache_sampling_step_start', type=int, default=20, help='Sampling step begins to use cache') + parser.add_argument('--cache_sampling_step_interval', type=int, default=2, help='Sampling step interval of cache') + parser.add_argument('--cache_dit_block_start', type=int, default=2, help='DiT block id begins to be cached') + parser.add_argument('--cache_num_dit_blocks', type=int, default=20, help='DiT blocks cached in each step') + args = parser.parse_args() + return args + + +def infer(args): + dtype = torch.bfloat16 + if args.dtype == 'bf16': + dtype = torch.bfloat16 + elif args.dtype == 'fp16': + dtype = torch.float16 + else: + logger.error("Not supported.") + # === Initialize Distributed === + init_parallel_env(args.sp) + + set_random_seed(args.seed + get_sequence_parallel_rank()) + + negative_prompt = """ + nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, + low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry. + """ + positive_prompt = """ + high quality, high aesthetic, {} + """ + if not os.path.exists(args.save_img_path): + os.makedirs(args.save_img_path, exist_ok=True) + + if not isinstance(args.text_prompt, list): + args.text_prompt = [args.text_prompt] + if len(args.text_prompt) == 1 and args.text_prompt[0].endswith('txt'): + text_prompt = open(args.text_prompt[0], 'r').readlines() + args.text_prompt = [i.strip() for i in text_prompt] + + vae = WFVAEModelWrapper.from_pretrained(args.ae_path, dtype=torch.float16).to("npu").eval() + vae.vae_scale_factor = ae_stride_config[args.ae] + transformer = OpenSoraT2Vv1_3.from_pretrained(args.model_path).to(dtype).to("npu").eval() + if args.use_cache: + if args.cache_sampling_step_start == 0 or args.cache_num_dit_blocks == 0: + logger.error("cache_sampling_step_start and cache_num_transformer_blocks should be greater than zero") + transformer.use_cache = False + else: + transformer.use_cache = True + transformer.cache_step_start = args.cache_sampling_step_start + transformer.cache_step_interval = args.cache_sampling_step_interval + transformer.cache_block_start = args.cache_dit_block_start + transformer.cache_num_blocks = args.cache_num_dit_blocks + + kwargs = dict( + prediction_type=args.prediction_type, + rescale_betas_zero_snr=args.rescale_betas_zero_snr, + timestep_spacing="trailing" if args.rescale_betas_zero_snr else 'leading', + ) + scheduler = EulerAncestralDiscreteScheduler(**kwargs) + text_encoder = MT5EncoderModel.from_pretrained(args.text_encoder_name_1, + torch_dtype=dtype).eval().to(dtype).to("npu") + tokenizer = AutoTokenizer.from_pretrained(args.text_encoder_name_1) + + if args.save_memory: + vae.vae.enable_tiling() + vae.vae.t_chunk_enc = 8 + vae.vae.t_chunk_dec = 2 + + pipeline = OpenSoraPlanPipeline13(vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + transformer=transformer, + scheduler=scheduler) + + with torch.no_grad(): + for i, input_prompt in enumerate(args.text_prompt): + input_prompt = positive_prompt.format(input_prompt) + start_time = time.time() + videos = pipeline( + input_prompt, + negative_prompt=negative_prompt, + num_frames=args.num_frames, + height=args.height, + width=args.width, + num_inference_steps=args.num_sampling_steps, + guidance_scale=args.guidance_scale, + num_samples_per_prompt=args.num_samples_per_prompt, + max_sequence_length=args.max_sequence_length, + )[0] + torch.npu.synchronize() + use_time = time.time() - start_time + logger.info("use_time: %.3f", use_time) + imageio.mimwrite( + os.path.join( + args.save_img_path, + f's{args.num_sampling_steps}_prompt{i}.mp4' + ), + videos[0], + fps=args.fps, + quality=6 + ) # highest quality is 10, lowest is 0 + +if __name__ == "__main__": + inference_args = parse_arguments() + infer(inference_args) \ No newline at end of file diff --git a/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13_dit_cache_search.py b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13_dit_cache_search.py new file mode 100644 index 0000000000000000000000000000000000000000..eee04e015bd456964bbcd49fdc2638715e77f816 --- /dev/null +++ b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/inference_opensoraplan13_dit_cache_search.py @@ -0,0 +1,199 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2024 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import time +import logging + +import torch +import torch_npu +import torch.distributed as dist +import imageio + +from transformers import AutoTokenizer, MT5EncoderModel +from mindiesd.pipeline.open_soar_plan_pipeline import OpenSoraPlanPipeline13 +from mindiesd.schedulers.scheduling_euler_ancestral_discrete import EulerAncestralDiscreteScheduler +from mindiesd.models.t2vdit import OpenSoraT2Vv1_3 +from mindiesd.models.wfvae import WFVAEModelWrapper, ae_stride_config +from mindiesd.utils import set_random_seed +from mindiesd.models.parallel_mgr import init_parallel_env, get_sequence_parallel_rank +from msmodelslim.pytorch.multimodal import DitCacheSearcherConfig, DitCacheSearcher + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + + +def parse_arguments(): + parser = argparse.ArgumentParser(description='Test Pipeline Argument Parser') + + parser.add_argument('--model_path', type=str, required=True, help='Path to the model directory') + parser.add_argument('--version', type=str, default='v1_3', help='Version of the model') + parser.add_argument('--dtype', type=str, default='fp16', help='Data type used in inference') + parser.add_argument('--num_frames', type=int, default=93, help='Number of frames') + parser.add_argument('--height', type=int, default=720, help='Height of the frames') + parser.add_argument('--width', type=int, default=1280, help='Width of the frames') + parser.add_argument('--text_encoder_name_1', type=str, required=True, help='Path to the text encoder model') + parser.add_argument('--text_prompt', type=str, required=True, help='Text prompt for the model') + parser.add_argument('--ae', type=str, default='WFVAEModel_D8_4x8x8', help='Autoencoder model type') + parser.add_argument('--ae_path', type=str, required=True, help='Path to the autoencoder model') + parser.add_argument('--save_img_path', type=str, default='./test', help='Path to save images') + parser.add_argument('--fps', type=int, default=24, help='Frames per second') + parser.add_argument('--guidance_scale', type=float, default=7.5, help='Guidance scale for the model') + parser.add_argument('--num_sampling_steps', type=int, default=10, help='Number of sampling steps') + parser.add_argument('--max_sequence_length', type=int, default=512, help='Maximum sequence length') + parser.add_argument('--seed', type=int, default=1234, help='Random seed') + parser.add_argument('--num_samples_per_prompt', type=int, default=1, help='Number of samples per prompt') + parser.add_argument('--rescale_betas_zero_snr', action='store_true', help='Rescale betas zero SNR') + parser.add_argument('--prediction_type', type=str, default='v_prediction', help='Type of prediction') + parser.add_argument('--save_memory', action='store_true', help='Save memory during processing') + parser.add_argument('--enable_tiling', action='store_true', help='Enable tiling for processing') + parser.add_argument('--sp', action='store_true') + parser.add_argument('--use_cache', action='store_true') + parser.add_argument('--cache_sampling_step_start', type=int, default=20, help='Sampling step begins to use cache') + parser.add_argument('--cache_sampling_step_interval', type=int, default=2, help='Sampling step interval of cache') + parser.add_argument('--cache_dit_block_start', type=int, default=2, help='DiT block id begins to be cached') + parser.add_argument('--cache_num_dit_blocks', type=int, default=20, help='DiT blocks cached in each step') + args = parser.parse_args() + return args + + +def prepare(): + global args + dtype = torch.bfloat16 + if args.dtype == 'bf16': + dtype = torch.bfloat16 + elif args.dtype == 'fp16': + dtype = torch.float16 + else: + logger.error("Not supported.") + # === Initialize Distributed === + init_parallel_env(True) + + if not os.path.exists(args.save_img_path): + os.makedirs(args.save_img_path, exist_ok=True) + + if not isinstance(args.text_prompt, list): + args.text_prompt = [args.text_prompt] + if len(args.text_prompt) == 1 and args.text_prompt[0].endswith('txt'): + text_prompt = open(args.text_prompt[0], 'r').readlines() + args.text_prompt = [i.strip() for i in text_prompt] + + vae = WFVAEModelWrapper.from_pretrained(args.ae_path, dtype=torch.float16).to("npu").eval() + vae.vae_scale_factor = ae_stride_config[args.ae] + transformer = OpenSoraT2Vv1_3.from_pretrained(args.model_path).to(dtype).to("npu").eval() + if args.use_cache: + if args.cache_sampling_step_start == 0 or args.cache_num_dit_blocks == 0: + logger.error("cache_sampling_step_start and cache_num_transformer_blocks should be greater than zero") + transformer.use_cache = False + else: + transformer.use_cache = True + transformer.cache_step_start = args.cache_sampling_step_start + transformer.cache_step_interval = args.cache_sampling_step_interval + transformer.cache_block_start = args.cache_dit_block_start + transformer.cache_num_blocks = args.cache_num_dit_blocks + + kwargs = dict( + prediction_type=args.prediction_type, + rescale_betas_zero_snr=args.rescale_betas_zero_snr, + timestep_spacing="trailing" if args.rescale_betas_zero_snr else 'leading', + ) + scheduler = EulerAncestralDiscreteScheduler(**kwargs) + text_encoder = MT5EncoderModel.from_pretrained(args.text_encoder_name_1, + torch_dtype=dtype).eval().to(dtype).to("npu") + tokenizer = AutoTokenizer.from_pretrained(args.text_encoder_name_1) + + if args.save_memory: + vae.vae.enable_tiling() + vae.vae.t_chunk_enc = 8 + vae.vae.t_chunk_dec = 2 + + pipeline = OpenSoraPlanPipeline13(vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + transformer=transformer, + scheduler=scheduler) + + return pipeline + +def generate_videos(config: DitCacheSearcherConfig, pipeline): + global args + local_rank = int(os.getenv('RANK', 0)) + if local_rank >= 0: + torch.manual_seed(args.seed + local_rank) + pipeline.transformer.use_cache = True + pipeline.transformer.cache_step_start = config.cache_step_start + pipeline.transformer.cache_step_interval = config.cache_step_interval + pipeline.transformer.cache_block_start = config.cache_dit_block_start + pipeline.transformer.cache_num_blocks = config.cache_num_dit_blocks + save_path = config.search_cache_path + if not os.path.exists(save_path): + os.makedirs(save_path, exist_ok=True) + negative_prompt = """ + nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, + low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry. + """ + positive_prompt = """ + high quality, high aesthetic, {} + """ + with torch.no_grad(): + for index, input_prompt in enumerate(args.text_prompt): + input_prompt = positive_prompt.format(input_prompt) + start_time = time.time() + videos = pipeline( + input_prompt, + negative_prompt=negative_prompt, + num_frames=args.num_frames, + height=args.height, + width=args.width, + num_inference_steps=args.num_sampling_steps, + guidance_scale=args.guidance_scale, + num_samples_per_prompt=args.num_samples_per_prompt, + max_sequence_length=args.max_sequence_length, + )[0] + torch.npu.synchronize() + use_time = time.time() - start_time + logger.info("use_time: %.3f", use_time) + if config.cache_num_dit_blocks == 0: + video_path = os.path.join(save_path, f'sample_{index:04d}_no_cache.mp4') + else: + video_path = os.path.join(save_path, + f'sample_{index:04d}_{config.cache_dit_block_start}_{config.cache_step_interval}_{config.cache_num_dit_blocks}_{config.cache_step_start}.mp4') + imageio.mimwrite( + video_path, + videos[0], + fps=args.fps, + quality=6 + ) # highest quality is 10, lowest is 0 + +if __name__ == "__main__": + global args + args = parse_arguments() + config = DitCacheSearcherConfig( + dit_block_num=32, + prompts_num=1, + num_sampling_steps=100, + cache_ratio=1.2, + search_cache_path='./cache', + cache_step_start=0, + cache_dit_block_start=0, + cache_num_dit_blocks=0 + ) + pipeline = prepare() + search_handler = DitCacheSearcher(config, pipeline, generate_videos) + cache_final_list = search_handler.search() + print(f'****************cache_final_list in \ + [cache_dit_block_start, cache_step_interval, cache_num_dit_blocks, cache_step_start] order: {cache_final_list}') diff --git a/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/requirements.txt b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9545f5a1726646a2339d4ab3563e87af023e7cc --- /dev/null +++ b/MindIE/MindIE-Torch/built-in/foundation/open-sora-plan/requirements.txt @@ -0,0 +1,18 @@ +torch==2.1.0 +diffusers==0.29.0 +transformers==4.44.2 +open_clip_torch==2.20.0 +av==12.0.0 +tqdm==4.66.1 +timm==0.9.12 +tensorboard==2.11.0 +pre-commit==3.8.0 +mmengine==0.10.4 +ftfy==6.1.3 +accelerate==0.26.1 +bs4 +torchvision==0.16.0 +einops +numpy==1.24.0 +imageio +imageio-ffmpeg \ No newline at end of file