diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/README.md b/MindIE/MultiModal/OpenSoraPlan-1.0/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3f720edd7663ef15ff24ebf6f390f6ce645a9fd3
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/README.md
@@ -0,0 +1,376 @@
+# MindIE SD
+
+## 一、介绍
+MindIE SD是MindIE的视图生成推理模型套件，其目标是为稳定扩散（Stable Diffusion, SD）系列大模型推理任务提供在昇腾硬件及其软件栈上的端到端解决方案，软件系统内部集成各功能模块，对外呈现统一的编程接口。
+
+## 二、安装依赖
+
+MindIE-SD其依赖组件为driver驱动包、firmware固件包、CANN开发套件包、推理引擎MindIE包，使用MindIE-SD前请提前安装这些依赖。
+
+| 简称            | 安装包全名                                                                     | 默认安装路径                               | 版本约束                              |
+| --------------- |---------------------------------------------------------------------------|--------------------------------------|-----------------------------------|
+| driver驱动包    | 昇腾310P处理器对应驱动软件包：Ascend-hdk-310p-npu-driver_\{version\}\_{os}\-{arch}.run | /usr/local/Ascend                    | 24.0.rc1及以上                       |
+| firmware固件包  | 昇腾310P处理器对应固件软件包：Ascend-hdk-310p-npu-firmware_\{version\}.run             | /usr/local/Ascend                    | 24.0.rc1及以上                       |
+| CANN开发套件包   | Ascend-cann-toolkit\_{version}_linux-{arch}.run                           | /usr/local/Ascend/ascend-toolkit/latest | 8.0.RC1及以上                        |
+| 推理引擎MindIE包 | Ascend-mindie\_\{version}_linux-\{arch}.run                               | /usr/local/Ascend/mindie/latest      | 和mindietorch严格配套使用                |
+| torch           | Python的whl包：torch-{version}-cp310-cp310-{os}_{arch}.whl                   | -                                    | Python版本3.10.x，torch版本支持2.1.0 |
+
+- {version}为软件包版本
+- {os}为系统名称，如Linux
+- {arch}为架构名称，如x86_64
+
+### 2.1 安装驱动和固件
+
+1. 获取地址
+- [800I A2](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=32&cann=8.0.RC1.beta1&driver=1.0.RC1.alpha)
+- [Duo卡](https://www.hiascend.com/hardware/firmware-drivers/community?product=2&model=17&cann=8.0.RC2.alpha002&driver=1.0.22.alpha)
+2. [安装指导手册](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0019.html)
+### 2.2 CANN开发套件包+kernel包+MindIE包下载
+1. 下载：
+- [800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32)
+- [Duo卡](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=2&model=17)
+2. [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html)
+
+3. 快速安装：
+- CANN开发套件包+kernel包安装
+```commandline
+# 增加软件包可执行权限，{version}表示软件版本号，{arch}表示CPU架构，{soc}表示昇腾AI处理器的版本。
+chmod +x ./Ascend-cann-toolkit_{version}_linux-{arch}.run
+chmod +x ./Ascend-cann-kernels-{soc}_{version}_linux.run
+# 校验软件包安装文件的一致性和完整性
+./Ascend-cann-toolkit_{version}_linux-{arch}.run --check
+./Ascend-cann-kernels-{soc}_{version}_linux.run --check
+# 安装
+./Ascend-cann-toolkit_{version}_linux-{arch}.run --install
+./Ascend-cann-kernels-{soc}_{version}_linux.run --install
+
+# 设置环境变量
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```
+- MindIE包安装
+```commandline
+# 增加软件包可执行权限，{version}表示软件版本号，{arch}表示CPU架构。
+chmod +x ./Ascend-mindie_${version}_linux-${arch}.run
+./Ascend-mindie_${version}_linux-${arch}.run --check
+
+# 方式一：默认路径安装
+./Ascend-mindie_${version}_linux-${arch}.run --install
+# 设置环境变量
+cd /usr/local/Ascend/mindie && source set_env.sh
+
+# 方式二：指定路径安装
+./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath}
+# 设置环境变量
+cd ${AieInstallPath}/mindie && source set_env.sh
+```
+
+- MindIE SD不需要单独安装，安装MindIE时将会自动安装
+- torch_npu 安装:
+下载 pytorch_v{pytorchversion}_py{pythonversion}.tar.gz
+```commandline
+tar -xzvf pytorch_v{pytorchversion}_py{pythonversion}.tar.gz
+# 解压后，会有whl包
+pip install torch_npu-{pytorchversion}.xxxx.{arch}.whl
+```
+
+### 2.3 pytorch框架(支持版本为：2.1.0)
+[安装包下载](https://download.pytorch.org/whl/cpu/torch/)
+
+使用pip安装
+```shell
+# {version}表示软件版本号，{arch}表示CPU架构。
+pip install torch-${version}-cp310-cp310-linux_${arch}.whl
+```
+
+## 三、OpenSora1.2使用
+
+### 3.1 权重及配置文件说明
+#### 3.1.1 下载子模型权重
+首先需要下载以下子模型权重和配置文件：text_encoder、tokenizer、transformer、vae、scheduler
+
+1.text_encoder:
+- 下载配置文件和权重文件，下载链接：
+```shell
+   https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main
+```
+下载后重命名文件夹为text_encoder
+
+2.tokenizer:
+将上述下载的text_encoder的权重和配置文件中的tokenizer_config.json和spiece.model拷贝并单独存放至另一个文件夹，重命名文件夹为tokenizer
+
+3.transformer：
+- 下载配置文件和权重文件，下载链接：
+```shell
+https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3/tree/main
+```
+下载后重命名文件夹为transformer
+
+4.vae：
+vae需要下载两部分权重:VAE和VAE_2d
+
+(1) VAE
+- 按以下链接下载权重和配置文件，并修改配置文件的architectures和model_type字段为VideoAutoencoder。参考MindIE-SD/examples/open-sora/vae/config.json。
+```shell
+https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2/tree/main
+```
+下载后重命名文件夹为vae
+
+(2) VAE_2d：
+- 按以下链接下载配置文件和权重文件，在上述vae文件夹下新建vae_2d/vae目录，并将下载的权重文件放置在路径下。
+```shell
+https://huggingface.co/PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers/tree/main
+```
+5.scheduler:
+- 采样器无需权重文件，配置文件参考MindIE-SD/examples/open-sora/scheduler/scheduler_config.json设置，并放置在scheduler文件夹下。
+
+#### 3.1.2 配置Pipeline
+1. 新建model_index.json配置文件，参考MindIE-SD/examples/open-sora-plan/model_index.json，与其他子模型文件夹同级目录。并将整体Pipeline权重文件夹命名为OpenSora1.2。
+
+2. 配置完成后示例如下。
+```commandline
+|----OpenSora1.2
+|    |---- model_index.json
+|    |---- scheduler
+|    |    |---- scheduler_config.json
+|    |---- text_encoder
+|    |    |---- config.json
+|    |    |---- pytorch_model-00001-of-00002.bin
+|    |    |---- pytorch_model-00002-of-00002.bin
+|    |    |---- pytorch_model.bin.index.json
+|    |    |---- special_tokens_map.json
+|    |    |---- spiece.model
+|    |    |---- tokenizer_config.json
+|    |---- tokenizer
+|    |    |---- spiece.model
+|    |    |---- tokenizer_config.json
+|    |---- transformer
+|    |    |---- config.json
+|    |    |---- model.safetensors
+|    |---- vae
+|    |    |---- config.json
+|    |    |---- model.safetensors
+|    |    |---- vae_2d
+|    |    |    |---- vae
+|    |    |    |    |---- config.json
+|    |    |    |    |---- diffusion_pytorch_model.safetensors
+```
+
+### 3.2 安装依赖库
+进入MindIE-SD路径，安装MindIE-SD的依赖库。
+```
+pip install -r requirements.txt
+```
+安装colossalai。 colossalai0.4.4 版本会自动安装高版本torch, 所以要单独安装。
+```
+pip install colossalai==0.4.4 --no-deps
+```
+### 3.3 单卡性能测试
+设置权重路径
+```shell
+path='./path'
+```
+执行命令：
+```shell
+python tests/inference_opensora12.py \
+       --path ${path} \
+       --device_id 0 \
+       --type bf16 \
+       --num_frames 32 \
+       --image_size 720,1280 \
+       --fps 8
+```
+参数说明：
+- path: 权重路径，包含vae、text_encoder、Tokenizer、Transformer和Scheduler五个模型的配置文件及权重。
+- device_id: 推理设备ID。
+- type: bf16、fp16。
+- num_frames:总帧数，范围：32, 128。
+- image_size：(720, 1280)、(512, 512)。
+- fps: 每秒帧数：8。
+- test_acc: 使用--test_acc开启全量视频生成，用于精度测试。性能测试时，不开启该参数。
+
+### 3.4 多卡性能测试
+设置权重路径
+```shell
+path='./path'
+```
+
+执行命令：
+```shell
+torchrun --nproc_per_node=4 tests/inference_opensora12.py \
+       --path ${path} \
+       --type bf16 \
+       --num_frames 32 \
+       --image_size (720,1280) \
+       --fps 8 \
+       --enable_sequence_parallelism True
+```
+参数说明： 
+- nproc_per_node: 并行推理的总卡数。
+- enable_sequence_parallelism 开启dsp 多卡并行
+- path: 权重路径，包含vae、text_encoder、Tokenizer、Transformer和Scheduler五个模型的配置文件及权重。
+- type: bf16、fp16。
+- num_frames:总帧数，范围：32, 128。
+- image_size：(720, 1280)、(512, 512)。
+- fps: 每秒帧数：8。
+
+精度测试参考第五节Vbench精度测试。
+
+## 四、OpenSoraPlan1.0使用
+
+### 4.1 权重及配置文件说明
+#### 4.1.1 下载子模型权重
+首先需要下载以下子模型权重：text_encoder、tokenizer、transformer、vae
+
+1.text_encoder:
+- 下载配置文件和权重文件，下载链接：
+```shell
+https://huggingface.co/DeepFloyd/t5-v1_1-xxl/tree/main
+```
+下载后重命名文件夹为text_encoder
+
+2.tokenizer:
+将上述下载的text_encoder的权重和配置文件中的tokenizer_config.json和spiece.model拷贝并单独存放至另一个文件夹，重命名文件夹为tokenizer
+
+3.transformer：
+- 下载配置文件和权重文件，根据需要下载不同分辨率和帧数的权重和配置文件，当前支持17x256x256、65x256x256、65x512x512三种规格，选择一种规格下载即可。下载链接：
+```shell
+https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main
+```
+下载完成后重命名文件夹为transformer
+
+4.vae：
+- 下载配置文件和权重文件，下载该链接下的vae文件夹：
+```shell
+https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main
+```
+
+#### 4.1.2 配置Pipeline
+1.新建model_index.json配置文件，参考MindIE-SD/examples/open-sora-plan/model_index.json，与其他子模型文件夹同级目录。并将整体Pipeline权重文件夹命名为Open-Sora-Plan-v1.0.0。
+
+2.配置完成后示例如下。
+```commandline
+|----Open-Sora-Plan-v1.0.0
+|    |---- model_index.json
+|    |---- text_encoder
+|    |    |---- config.json
+|    |    |---- pytorch_model-00001-of-00002.bin
+|    |    |---- pytorch_model-00002-of-00002.bin
+|    |    |---- pytorch_model.bin.index.json
+|    |    |---- special_tokens_map.json
+|    |    |---- spiece.model
+|    |    |---- tokenizer_config.json
+|    |---- tokenizer
+|    |    |---- spiece.model
+|    |    |---- tokenizer_config.json
+|    |---- transformer
+|    |    |---- config.json
+|    |    |---- diffusion_pytorch_model.safetensors
+|    |---- vae
+|    |    |---- config.json
+|    |    |---- diffusion_pytorch_model.safetensors
+```
+
+### 4.2 安装依赖库
+进入MindIE-SD/mindiesd/requirements路径，安装open-sora-plan1.0的依赖库。
+```
+pip install -r requirements_opensoraplan.txt
+```
+安装colossalai。 colossalai0.4.4 版本会自动安装高版本torch, 所以要单独安装。
+```
+pip install colossalai==0.4.4 --no-deps
+```
+### 4.3 单卡性能测试
+设置权重路径
+```shell
+model_path='./model_path'
+```
+执行命令：
+```shell
+python tests/inference_opensora_plan.py \
+       --model_path ${model_path} \
+       --text_prompt tests/t2v_sora.txt \
+       --sample_method PNDM \
+       --save_img_path ./sample_videos/t2v_PNDM \
+       --image_size 512 \
+       --fps 24 \
+       --guidance_scale 7.5 \
+       --num_sampling_steps 250 \
+       --seed 5464
+```
+参数说明：
+- model_path: 权重路径，包含vae、text_encoder、Tokenizer、Transformer和Scheduler五个模型的配置文件及权重。
+- text_prompt: 输入prompt，可以为list形式或txt文本文件（按行分割）。
+- sample_method：采样器名称，默认PNDM，只支持['DDIM', 'EulerDiscrete', 'DDPM', 'DPMSolverMultistep','DPMSolverSinglestep', 'PNDM', 'HeunDiscrete', 'EulerAncestralDiscrete', 'DEISMultistep', 'KDPM2AncestralDiscrete']。若要使用采样步数优化，则选择"DPMSolverSinglestep"或"DDPM"。
+- save_img_path：生成视频的保存路径，默认./sample_videos/t2v。
+- image_size：生成视频的分辨率，需与下载的transformer权重版本对应，为512或256。
+- fps：生成视频的帧率，默认24。
+- guidance_scale：生成视频中cfg的参数，默认7.5。
+- num_sampling_steps：生成视频采样迭代次数，默认250步。若采用"DPMSolverSinglestep"或"DDPM"采样器，可设置为50步。
+- seed: 随机种子设置。
+
+注：若出现"RuntimeError: NPU out of memory."报错，可能是torch_npu最新版本默认把虚拟内存关闭，可尝试设置```export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True```环境变量解决。
+
+### 4.4 多卡性能测试
+设置权重路径、并行推理的总卡数
+```shell
+model_path='./model_path'
+NUM_DEVICES=4
+```
+
+执行命令：
+```shell
+ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=$NUM_DEVICES tests/inference_opensora_plan.py \
+       --model_path ${model_path} \
+       --text_prompt tests/t2v_sora.txt \
+       --sample_method PNDM \
+       --save_img_path ./sample_videos/t2v_PNDM_dsp_$NUM_DEVICES \
+       --image_size 512 \
+       --fps 24 \
+       --guidance_scale 7.5 \
+       --num_sampling_steps 250 \
+       --seed 5464 \
+       --sequence_parallel_size $NUM_DEVICES
+```
+参数说明： 
+- ASCEND_RT_VISIBLE_DEVICES: 指定使用的具体推理设备ID
+- nproc_per_node: 并行推理的总卡数。
+- sequence_parallel_size: 序列并行的数量。
+其余参数同上
+
+### 4.5 开启patch相似性压缩测试
+设置权重路径
+```shell
+model_path='./model_path'
+```
+执行命令：
+```shell
+python tests/inference_opensora_plan.py \
+       --model_path ${model_path} \
+       --text_prompt tests/t2v_sora.txt \
+       --sample_method PNDM \
+       --save_img_path ./sample_videos/t2v_PNDM \
+       --image_size 512 \
+       --fps 24 \
+       --guidance_scale 7.5 \
+       --num_sampling_steps 250 \
+       --use_cache \
+       --cache_config 5,27,5,2 \
+       --cfg_last_step 150 \
+       --seed 5464
+```
+参数说明：
+- use_cache: 是否开启DiT-Cache，不设置则不开启。
+- cache_config: DiT-Cache的配置参数，需设置4个数值，分别为start_block_idx, end_block_idx, start_step, step_interval。
+- cfg_last_step：开启跳过cfg计算的步数。
+其余参数同上。
+
+精度测试参考第五节Vbench精度测试。
+
+## 五、Vbench精度测试(基于gpu)
+1、视频生成完成后，精度测试推荐使用业界常用的VBench(Video Benchmark)工具，详见如下链接：
+```shell
+https://github.com/Vchitect/VBench
+```
+2、当前主要评估指标为[subject_consistency, imaging_quality, aesthetic_quality, overall_consistency, motion_smoothness]。
+
+注：vbench各精度指标平均下降不超过1%可认为该视频质量无下降。
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/inference_opensora_plan.py b/MindIE/MultiModal/OpenSoraPlan-1.0/inference_opensora_plan.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9f9d421671c1270b8d1f91c41ea571d8e9309f8
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/inference_opensora_plan.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import time
+import argparse
+import logging
+
+import torch
+import torch_npu
+from torchvision.utils import save_image
+import imageio
+import colossalai
+
+sys.path.append(os.path.split(sys.path[0])[0])
+
+from opensoraplan import OpenSoraPlanPipeline
+from opensoraplan import compile_pipe, get_scheduler, set_parallel_manager
+from opensoraplan import CacheConfig, OpenSoraPlanDiTCacheManager
+
+MASTER_PORT = '42043'
+
+
+def main(args):
+    torch.manual_seed(args.seed)
+    torch.npu.manual_seed(args.seed)
+    torch.npu.manual_seed_all(args.seed)
+    torch.set_grad_enabled(False)
+    device = "npu" if torch.npu.is_available() else "cpu"
+
+    sp_size = args.sequence_parallel_size
+    if sp_size == 1:
+        os.environ['RANK'] = '0'
+        os.environ['LOCAL_RANK'] = '0'
+        os.environ['WORLD_SIZE'] = '1'
+        os.environ['MASTER_ADDR'] = 'localhost'
+        os.environ['MASTER_PORT'] = MASTER_PORT
+    colossalai.launch_from_torch({}, seed=args.seed)
+    set_parallel_manager(sp_size=args.sequence_parallel_size, sp_axis=0)
+
+    if args.force_images:
+        ext = 'jpg'
+    else:
+        ext = 'mp4'
+    scheduler = get_scheduler(args.sample_method)
+    # load the pipeline model weights and config
+    videogen_pipeline = OpenSoraPlanPipeline.from_pretrained(model_path=args.model_path,
+                                                             image_size=args.image_size,
+                                                             scheduler=scheduler,
+                                                             dtype=torch.float16,
+                                                             vae_stride=args.vae_stride)
+    # prepare the cache_manager
+    cache_nums = [int(i) for i in args.cache_config.split(',')]
+    if len(cache_nums) != 4:
+        raise ValueError("cache_config num length must equals 4.")
+    cache_manager = OpenSoraPlanDiTCacheManager(
+        CacheConfig(cache_nums[0], cache_nums[1], cache_nums[2], cache_nums[3], args.use_cache))
+    # compile pipeline and set the cache_manager and cfg_last_step
+    videogen_pipeline = compile_pipe(videogen_pipeline, cache_manager, args.cfg_last_step)
+
+    if not os.path.exists(args.save_img_path):
+        os.makedirs(args.save_img_path)
+
+    # read the prompt contents
+    if not isinstance(args.text_prompt, list):
+        args.text_prompt = [args.text_prompt]
+    if len(args.text_prompt) == 1 and args.text_prompt[0].endswith('txt'):
+        text_prompt = open(args.text_prompt[0], 'r').readlines()
+        args.text_prompt = [i.strip() for i in text_prompt]
+        args.text_prompt = args.text_prompt
+
+    time_list = []
+    # pipeline inference
+    for idx, prompt in enumerate(args.text_prompt):
+        torch_npu.npu.synchronize()
+        start_time = time.time()
+        torch.manual_seed(args.seed)
+        torch.npu.manual_seed(args.seed)
+        torch.npu.manual_seed_all(args.seed)
+        logging.info('Processing the (%s) prompt', prompt)
+        videos = videogen_pipeline(prompt,
+                                   num_inference_steps=args.num_sampling_steps,
+                                   guidance_scale=args.guidance_scale,
+                                   enable_temporal_attentions=not args.force_images,
+                                   num_images_per_prompt=1,
+                                   ).video
+        if videogen_pipeline.transformer.cache_manager.cal_block_num != 0:
+            ratio = (
+                    videogen_pipeline.transformer.cache_manager.all_block_num
+                    / videogen_pipeline.transformer.cache_manager.cal_block_num
+            )
+        else:
+            raise ZeroDivisionError("transformer cal_block_num can not be zero.")
+        logging.info("cal_block_ratio: %.2f, %d, %d",
+                     ratio, videogen_pipeline.transformer.cache_manager.cal_block_num,
+                     videogen_pipeline.transformer.cache_manager.all_block_num)
+        torch_npu.npu.synchronize()
+        time_list.append(time.time() - start_time)
+        try:
+            if args.force_images:
+                videos = videos[:, 0].permute(0, 3, 1, 2)  # b t h w c -> b c h w
+                save_image(
+                    videos / 255.0,
+                    os.path.join(
+                        args.save_img_path,
+                        prompt.replace(' ', '_')[:100] +
+                        f'{args.sample_method}_gs{args.guidance_scale}_s{args.num_sampling_steps}.{ext}',
+                    ),
+                    nrow=1, normalize=True, value_range=(0, 1)
+                )  # t c h w
+            else:
+                imageio.mimwrite(
+                    os.path.join(
+                        args.save_img_path,
+                        f'sample_{idx}_{args.sample_method}_gs{args.guidance_scale}_s{args.num_sampling_steps}.{ext}'
+                    ), videos[0],
+                    fps=args.fps, quality=9)  # highest quality is 10, lowest is 0
+            logging.info('Saving sample_%d for %s %d steps success!!!', \
+                         idx, args.sample_method, args.num_sampling_steps)
+        except IOError as e:
+            logging.error('Error when saving sample_%d for %s %d steps for %s!!!', \
+                          idx, args.sample_method, args.num_sampling_steps, prompt)
+            sys.exit('An error occured and the program will exit.')
+
+    logging.info("time_list: %s", time_list)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default='/data1/models/Open-Sora-Plan-v1.0.0')
+    parser.add_argument("--save_img_path", type=str, default="./sample_videos/t2v")
+    parser.add_argument("--guidance_scale", type=float, default=7.5)
+    parser.add_argument("--sample_method", type=str, default="PNDM")
+    parser.add_argument("--num_sampling_steps", type=int, default=250)
+    parser.add_argument("--image_size", type=int, default=512)
+    parser.add_argument("--fps", type=int, default=24)
+    parser.add_argument("--run_time", type=int, default=0)
+    parser.add_argument("--seed", type=int, default=2333)
+    parser.add_argument("--vae_stride", type=int, default=8)
+    parser.add_argument("--cache_config", type=str, default="5,27,5,2")
+    parser.add_argument('--use_cache', action='store_true')
+    parser.add_argument("--cfg_last_step", type=int, default=10000)
+    parser.add_argument("--text_prompt", nargs='+')
+    parser.add_argument('--force_images', action='store_true')
+    parser.add_argument('--sequence_parallel_size', type=int, default=1)
+    args_input = parser.parse_args()
+
+    if not os.path.exists(args_input.model_path):
+        logging.warning('WARNING:wrong model_path given !!!')
+        sys.exit('An error occured and the program will exit.')
+
+    main(args_input)
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee4d85ed3794ec8465a3e4c5977ddd1281712aac
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/__init__.py
@@ -0,0 +1,40 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+__all__ = [
+    'ConfigMixin',
+    'compile_pipe',
+    'LatteT2V',
+    'LatteParams',
+    'CausalVAEModelWrapper',
+    'OpenSoraPlanPipeline',
+    'set_parallel_manager',
+    'get_scheduler',
+    'CacheConfig',
+    'OpenSoraPlanDiTCacheManager'
+]
+
+from .config_utils import ConfigMixin
+from .pipeline.compile_pipe import compile_pipe
+
+from .models.latte.modeling_latte import LatteT2V, LatteParams
+from .models.causalvae.modeling_causalvae import CausalVAEModelWrapper
+from .pipeline.open_sora_plan_pipeline import OpenSoraPlanPipeline
+from .models.parallel_mgr import set_parallel_manager
+from .schedulers.scheduler_optimizer import get_scheduler
+from .acceleration.dit_cache_common import CacheConfig
+from .acceleration.open_sora_plan_dit_cache import OpenSoraPlanDiTCacheManager
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cc6b36093052f0d9bf22369e10ab21feeffa069
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/__init__.py
@@ -0,0 +1,18 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .dit_cache_common import CacheConfig
+from .open_sora_plan_dit_cache import OpenSoraPlanDiTCacheManager
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/dit_cache_common.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/dit_cache_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c89f2fa58ac71eaec0fb8f4e8364cd7976b16c6
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/dit_cache_common.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, List, Union
+from dataclasses import dataclass
+import torch
+import torch.nn as nn
+
+
+@dataclass
+class CacheConfig:
+    start_block_idx: int = 0
+    end_block_idx: int = 0
+    start_step: int = 0
+    step_interval: int = 2
+    use_cache: bool = False
+    use_cache_encoder: bool = False
+
+    def __post_init__(self):
+        if not isinstance(self.start_block_idx, int):
+            raise TypeError(f"Expected int for start_block_idx, but got {type(self.start_block_idx).__name__}")
+        if not isinstance(self.end_block_idx, int):
+            raise TypeError(f"Expected int for end_block_idx, but got {type(self.end_block_idx).__name__}")
+        if not isinstance(self.start_step, int):
+            raise TypeError(f"Expected int for start_step, but got {type(self.start_step).__name__}")
+        if not isinstance(self.step_interval, int):
+            raise TypeError(f"Expected int for step_interval, but got {type(self.step_interval).__name__}")
+        if not isinstance(self.use_cache, bool):
+            raise TypeError(f"Expected bool for use_cache, but got {type(self.use_cache).__name__}")
+        if not isinstance(self.use_cache_encoder, bool):
+            raise TypeError(f"Expected bool for use_cache_encoder, but got {type(self.use_cache_encoder).__name__}")
+
+
+class DiTCacheManager:
+    def __init__(
+        self,
+        cache_config: CacheConfig
+    ):
+        """
+        DiTCache plugin for the DiT models. Use this class to enable model cache quickly.
+        Args:
+            start_block_idx: (`int`)
+                The index of the block where chaching starts.
+            end_block_idx: (`int`)
+                The index of the block where chaching starts.
+            start_step: (`int`)
+                The index of the DiT denoising step where caching starts.
+            step_interval: (`int`)
+                Interval of caching steps fot DiT denoising.
+            use_cache_encoder: (`bool`)
+                Whether the DiT models need to compute encoder_hidden_states.
+        """
+        self.start_block_idx = cache_config.start_block_idx
+        self.end_block_idx = cache_config.end_block_idx
+        self.start_step = cache_config.start_step
+        self.step_interval = cache_config.step_interval
+        self.use_cache = cache_config.use_cache
+        self.use_cache_encoder = cache_config.use_cache_encoder
+
+        self.cache_hidden_states = None
+        self.cache_encoder_hidden_states = None
+
+        if self.start_block_idx > self.end_block_idx:
+            raise ValueError("start_block_idx should not be larger than end_block_idx")
+
+    def __call__(
+        self,
+        current_step: int,
+        block_list: Union[List[nn.ModuleList], List[List[nn.ModuleList]]],
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        **kwargs
+    ):
+        # If current_step is less than cache start step, execute all blocks sequentially.
+        if (current_step < self.start_step) or (not self.use_cache):
+            for blocks in zip(*block_list):
+                hidden_states, encoder_hidden_states = self._forward_blocks(blocks, hidden_states,
+                                                                            encoder_hidden_states, **kwargs)
+        # go into cache step interval
+        else:
+            cache_hidden_states = torch.zeros_like(hidden_states)
+            cache_encoder_hidden_states = torch.zeros_like(encoder_hidden_states)
+            for block_idx, blocks in enumerate(zip(*block_list)):
+                # when current_step is exactly on the step_interval, compute and record the cache.
+                if current_step % self.step_interval == self.start_step % self.step_interval:
+                    # record the tensor before DiT denoising.
+                    if block_idx == self.start_block_idx:
+                        cache_hidden_states = hidden_states.clone()
+                        if self.use_cache_encoder:
+                            cache_encoder_hidden_states = encoder_hidden_states.clone()
+
+                    hidden_states, encoder_hidden_states = self._forward_blocks(blocks, hidden_states,
+                                                                                encoder_hidden_states, **kwargs)
+                    # cache the denoising difference.
+                    if block_idx == (self.end_block_idx - 1):
+                        self.cache_hidden_states = hidden_states - cache_hidden_states
+                        if self.use_cache_encoder:
+                            self.cache_encoder_hidden_states = encoder_hidden_states - cache_encoder_hidden_states
+                else:
+                    # if block_idx is not in the interval using cache, execute all blocks sequentially.
+                    if block_idx < self.start_block_idx or block_idx >= self.end_block_idx:
+                        hidden_states, encoder_hidden_states = self._forward_blocks(blocks, hidden_states,
+                                                                                    encoder_hidden_states, **kwargs)
+                    # skip intermediate steps until the end_block_idx, overlay the cached denoising difference.
+                    elif block_idx == (self.end_block_idx - 1):
+                        hidden_states += self.cache_hidden_states
+                        if self.use_cache_encoder:
+                            encoder_hidden_states += self.cache_encoder_hidden_states
+
+        return hidden_states
+
+    def _forward_blocks(self, blocks, hidden_states, encoder_hidden_states, **kwargs):
+        for block in blocks:
+            results = block(hidden_states, encoder_hidden_states=encoder_hidden_states, **kwargs)
+            if self.use_cache_encoder:
+                hidden_states, encoder_hidden_states = results
+            else:
+                hidden_states = results
+
+        return hidden_states, encoder_hidden_states
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/open_sora_plan_dit_cache.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/open_sora_plan_dit_cache.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a00579c770d79d6cd7d16695668e1ce22fe0b40
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/acceleration/open_sora_plan_dit_cache.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Union
+from einops import rearrange, repeat
+import torch
+import torch.nn as nn
+from opensoraplan.models.comm import (
+    all_to_all_with_pad,
+    get_spatial_pad,
+    get_temporal_pad,
+)
+from opensoraplan.models.parallel_mgr import (
+    get_sequence_parallel_group,
+    use_sequence_parallel
+)
+from .dit_cache_common import CacheConfig, DiTCacheManager
+
+SLICE_TEMPORAL_PATTERN = '(b T) S d -> b T S d'
+CHANGE_TF_PATTERN = '(b t) f d -> (b f) t d'
+
+
+class OpenSoraPlanDiTCacheManager(DiTCacheManager):
+    def __init__(
+            self,
+            cache_config: CacheConfig,
+    ):
+        if not isinstance(cache_config, CacheConfig):
+            raise TypeError(f"Expected CacheConfig for cache_config, but got {type(cache_config).__name__}")
+        super().__init__(cache_config)
+        self.temp_pos_embed = None
+        self.all_block_num = 0
+        self.cal_block_num = 0
+        self.delta_cache = None
+
+    def __call__(
+            self,
+            current_step: int,
+            block_list: Union[List[nn.ModuleList], List[List[nn.ModuleList]]],
+            hidden_states: torch.Tensor,
+            **kwargs
+    ):
+        num_blocks = len(block_list[0])
+        if self.start_block_idx < 0 or self.start_block_idx > num_blocks:
+            raise ValueError("start_block_idx is invalid, out of range [0, num_blocks]")
+        if self.end_block_idx < 0 or self.end_block_idx > num_blocks:
+            raise ValueError("end_block_idx is invalid, out of range [0, num_blocks]")
+        # If current_step is less than cache start step, execute all blocks sequentially.
+        if (current_step < self.start_step) or (not self.use_cache):
+            self.all_block_num += (num_blocks * 2)
+            hidden_states = self._forward_blocks(0, num_blocks, block_list, hidden_states, **kwargs)
+        # go into cache step interval
+        else:
+            self.all_block_num += (num_blocks * 2)
+            # infer [0, start_block_idx)
+            hidden_states = self._forward_blocks(0, self.start_block_idx, block_list, hidden_states, **kwargs)
+            # infer [start_block_idx, end_block_idx)
+            hidden_states_before_cache = hidden_states.clone()
+            if current_step % self.step_interval == self.start_step % self.step_interval:
+                hidden_states = self._forward_blocks(self.start_block_idx, self.end_block_idx, block_list,
+                                                     hidden_states, **kwargs)
+                self.delta_cache = hidden_states - hidden_states_before_cache
+            else:
+                if self.delta_cache.shape == hidden_states_before_cache.shape:
+                    hidden_states = hidden_states_before_cache + self.delta_cache
+                else:
+                    hidden_states = self._forward_blocks(self.start_block_idx, self.end_block_idx, block_list,
+                                                         hidden_states, **kwargs)
+            hidden_states = self._forward_blocks(self.end_block_idx, num_blocks, block_list,
+                                                 hidden_states, **kwargs)
+
+        return hidden_states
+
+    def _forward_blocks(self, start_idx, end_idx, block_list, hidden_states, **kwargs):
+        attention_mask = kwargs.get("attention_mask")
+        encoder_hidden_states_spatial = kwargs.get("encoder_hidden_states_spatial")
+        encoder_attention_mask = kwargs.get("encoder_attention_mask")
+        timestep_spatial = kwargs.get("timestep_spatial")
+        timestep_temp = kwargs.get("timestep_temp")
+        cross_attention_kwargs = kwargs.get("cross_attention_kwargs")
+        class_labels = kwargs.get("class_labels")
+        input_batch_size = kwargs.get("input_batch_size")
+        enable_temporal_attentions = kwargs.get("enable_temporal_attentions")
+        t_dim = kwargs.get("t_dim")
+        s_dim = kwargs.get("s_dim")
+        timestep = kwargs.get("timestep")
+        for i, (spatial_block, temp_block) in enumerate(
+                zip(block_list[0][start_idx:end_idx], block_list[1][start_idx:end_idx])):
+            self.cal_block_num += input_batch_size
+            hidden_states = spatial_block(
+                hidden_states,
+                attention_mask,
+                encoder_hidden_states_spatial,
+                encoder_attention_mask,
+                timestep_spatial,
+                cross_attention_kwargs,
+                class_labels,
+            )
+
+            if enable_temporal_attentions:
+                if use_sequence_parallel():
+                    hidden_states = rearrange(hidden_states, SLICE_TEMPORAL_PATTERN, T=t_dim,
+                                              S=s_dim).contiguous()
+                    hidden_states, s_dim, t_dim = self._dynamic_switch(hidden_states, s_dim, t_dim,
+                                                                       temporal_to_spatial=True)
+                    timestep_temp = repeat(timestep, 'b d -> (b p) d', p=s_dim).contiguous()
+
+                # b c f h w, f = 16 + 4
+                hidden_states = rearrange(hidden_states, '(b T) S d -> (b S) T d', b=input_batch_size).contiguous()
+
+                if start_idx + i == 0:
+                    hidden_states = hidden_states + self.temp_pos_embed
+
+                hidden_states = temp_block(
+                    hidden_states,
+                    None,  # attention_mask
+                    None,  # encoder_hidden_states
+                    None,  # encoder_attention_mask
+                    timestep_temp,
+                    cross_attention_kwargs,
+                    class_labels,
+                )
+
+                hidden_states = rearrange(hidden_states, CHANGE_TF_PATTERN,
+                                          b=input_batch_size).contiguous()
+                if use_sequence_parallel():
+                    hidden_states = rearrange(hidden_states, SLICE_TEMPORAL_PATTERN, T=t_dim,
+                                              S=s_dim).contiguous()
+                    hidden_states, s_dim, t_dim = self._dynamic_switch(hidden_states, s_dim, t_dim,
+                                                                       temporal_to_spatial=False)
+        return hidden_states
+
+    def _dynamic_switch(self, x, s, t, temporal_to_spatial: bool):
+        if temporal_to_spatial:
+            scatter_dim, gather_dim = 2, 1
+            scatter_pad = get_spatial_pad()
+            gather_pad = get_temporal_pad()
+        else:
+            scatter_dim, gather_dim = 1, 2
+            scatter_pad = get_temporal_pad()
+            gather_pad = get_spatial_pad()
+
+        x = all_to_all_with_pad(
+            x,
+            get_sequence_parallel_group(),
+            scatter_dim=scatter_dim,
+            gather_dim=gather_dim,
+            scatter_pad=scatter_pad,
+            gather_pad=gather_pad,
+        )
+        new_s, new_t = x.shape[2], x.shape[1]
+        x = rearrange(x, "b t s d -> (b t) s d")
+        return x, new_s, new_t
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/config_utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/config_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..02cfc2e4ee85979b0f19eb61200782fe11b2198d
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/config_utils.py
@@ -0,0 +1,70 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright(C) 2024. Huawei Technologies Co.,Ltd. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+import os
+import json
+import inspect
+from typing import Dict, Tuple
+from .utils.log import logger
+
+
+class ConfigMixin:
+    config_name = None
+
+    @classmethod
+    def load_config(cls, model_path, **kwargs) -> Tuple[Dict, Dict]:
+        if cls.config_name is None:
+            logger.error("config_name is not defined.")
+            raise ValueError("config_name is not defined.")
+
+        if model_path is None:
+            logger.error("model_path must not be None")
+            raise ValueError("model_path must not be None")
+
+        model_path = os.path.abspath(model_path)
+        config_path = os.path.join(model_path, cls.config_name)
+        if not (os.path.exists(config_path) and os.path.isfile(config_path)):
+            logger.error("%s is not found in %s!", cls.config_name, model_path)
+            raise ValueError("%s is not found in %s!" % (cls.config_name, model_path))
+
+        config_dict = _load_json_dict(config_path)
+
+        # get all required parameters
+        all_parameters = inspect.signature(cls.__init__).parameters
+
+        init_keys = set(dict(all_parameters))
+        init_keys.remove("self")
+        if 'kwargs' in init_keys:
+            init_keys.remove('kwargs')
+
+        init_dict = {}
+        for key in init_keys:
+            # if key in config, use config
+            if key in config_dict:
+                init_dict[key] = config_dict.pop(key)
+            # if key in kwargs, use kwargs, this may rewrite config_dict
+            if key in kwargs:
+                init_dict[key] = kwargs.pop(key)
+        in_keys = set(init_dict.keys())
+        if len(init_keys - in_keys) > 0:
+            logger.warning("%s was not found in config and kwargs! Use default values.", init_keys - in_keys)
+        return init_dict, config_dict
+
+
+def _load_json_dict(config_path):
+    with open(config_path, "r", encoding="utf-8") as reader:
+        data = reader.read()
+    return json.loads(data, strict=False)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..59c2de987eb34731436e22c19f62e4a87615b00b
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/__init__.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .attention import (
+    AttnBlock3D,
+    AttnBlock,
+    LinAttnBlock,
+    LinearAttention,
+)
+from .conv import CausalConv3d, Conv2d
+from .resnet_block import ResnetBlock2D, ResnetBlock3D
+from .ops import nonlinearity, normalize
+from .updownsample import (
+    SpatialDownsample2x,
+    SpatialUpsample2x,
+    TimeDownsample2x,
+    TimeUpsample2x,
+    TimeDownsampleRes2x,
+    TimeUpsampleRes2x,
+)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/attention.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d49b682dde1efceedc6ff1bc278b2e4d76ba112
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/attention.py
@@ -0,0 +1,157 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch.nn as nn
+import torch
+from einops import rearrange
+from opensoraplan.utils.log import logger
+from .ops import video_to_image, normalize
+from .conv import CausalConv3d
+
+ATTN_TYPE_VANILLA = "vanilla"
+ATTN_TYPE_VANILLA3D = "vanilla3D"
+ATTN_TYPE_LINEAR = "linear"
+ATTN_TYPE_NONE = "none"
+
+
+class LinearAttention(nn.Module):
+    def __init__(self, dim, heads=4, dim_head=32):
+        super().__init__()
+        self.heads = heads
+        hidden_dim = dim_head * heads
+        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
+        self.to_out = nn.Conv2d(hidden_dim, dim, 1)
+
+    def forward(self, x):
+        b, c, h, w = x.shape
+        qkv = self.to_qkv(x)
+        q, k, v = rearrange(
+            qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3
+        )
+        k = k.softmax(dim=-1)
+        context = torch.einsum("bhdn,bhen->bhde", k, v)
+        out = torch.einsum("bhde,bhdn->bhen", context, q)
+        out = rearrange(
+            out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w
+        )
+        return self.to_out(out)
+
+
+class LinAttnBlock(LinearAttention):
+    """to match AttnBlock usage"""
+
+    def __init__(self, in_channels):
+        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)
+
+
+class AttnBlock3D(nn.Module):
+    """Compatible with old versions, there are issues, use with caution."""
+
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+
+        self.norm = normalize(in_channels)
+        self.q = CausalConv3d(in_channels, in_channels, kernel_size=1, stride=1)
+        self.k = CausalConv3d(in_channels, in_channels, kernel_size=1, stride=1)
+        self.v = CausalConv3d(in_channels, in_channels, kernel_size=1, stride=1)
+        self.proj_out = CausalConv3d(in_channels, in_channels, kernel_size=1, stride=1)
+
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+
+        # compute attention
+        b, c, t, h, w = q.shape
+        q = q.reshape(b * t, c, h * w)
+        q = q.permute(0, 2, 1)  # b,hw,c
+        k = k.reshape(b * t, c, h * w)  # b,c,hw
+        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        w_ = w_ * (int(c) ** (-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+
+        # attend to values
+        v = v.reshape(b * t, c, h * w)
+        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
+        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = h_.reshape(b, c, t, h, w)
+
+        h_ = self.proj_out(h_)
+
+        return x + h_
+
+
+class AttnBlock(nn.Module):
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+
+        self.norm = normalize(in_channels)
+        self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0)
+        self.k = torch.nn.Conv2d(
+            in_channels, in_channels, kernel_size=1, stride=1, padding=0
+        )
+        self.v = torch.nn.Conv2d(
+            in_channels, in_channels, kernel_size=1, stride=1, padding=0
+        )
+        self.proj_out = torch.nn.Conv2d(
+            in_channels, in_channels, kernel_size=1, stride=1, padding=0
+        )
+
+    @video_to_image
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+
+        # compute attention
+        b, c, h, w = q.shape
+        q = q.reshape(b, c, h * w)
+        q = q.permute(0, 2, 1)  # b,hw,c
+        k = k.reshape(b, c, h * w)  # b,c,hw
+        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        w_ = w_ * (int(c) ** (-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+
+        # attend to values
+        v = v.reshape(b, c, h * w)
+        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
+        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = h_.reshape(b, c, h, w)
+
+        h_ = self.proj_out(h_)
+
+        return x + h_
+
+
+def make_attn(in_channels, attn_type=ATTN_TYPE_VANILLA):
+    if attn_type not in [ATTN_TYPE_VANILLA, ATTN_TYPE_LINEAR, ATTN_TYPE_NONE, ATTN_TYPE_VANILLA3D]:
+        logger.error(f"attn_type {attn_type} unknown")
+        raise ValueError
+    logger.info(f"making attention of type '{attn_type}' with {in_channels} in_channels")
+    if attn_type == "vanilla":
+        return AttnBlock(in_channels)
+    elif attn_type == "vanilla3D":
+        return AttnBlock3D(in_channels)
+    elif attn_type == "none":
+        return nn.Identity(in_channels)
+    else:
+        return LinAttnBlock(in_channels)
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/conv.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/conv.py
new file mode 100644
index 0000000000000000000000000000000000000000..546a8a4bce8f44539619900f790c834d11c129c4
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/conv.py
@@ -0,0 +1,137 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import math
+from typing import Union, Tuple
+import torch.nn as nn
+import torch.nn.functional as F
+import torch
+from opensoraplan.utils.log import logger
+from .ops import cast_tuple
+from .ops import video_to_image
+
+
+class Conv2d(nn.Conv2d):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: Union[int, Tuple[int]] = 3,
+        stride: Union[int, Tuple[int]] = 1,
+        padding: Union[str, int, Tuple[int]] = 0,
+        dilation: Union[int, Tuple[int]] = 1,
+        groups: int = 1,
+        bias: bool = True,
+        padding_mode: str = "zeros",
+        device=None,
+        dtype=None,
+    ) -> None:
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            groups,
+            bias,
+            padding_mode,
+            device,
+            dtype,
+        )
+        
+    @video_to_image
+    def forward(self, x):
+        return super().forward(x)
+        
+
+class CausalConv3d(nn.Module):
+    def __init__(
+        self, chan_in, chan_out, kernel_size: Union[int, Tuple[int, int, int]], init_method="random", **kwargs
+    ):
+        super().__init__()
+        self.kernel_size = cast_tuple(kernel_size, 3)
+        self.time_kernel_size = self.kernel_size[0]
+        self.chan_in = chan_in
+        self.chan_out = chan_out
+        stride = kwargs.pop("stride", 1)
+        padding = kwargs.pop("padding", 0)
+        padding = list(cast_tuple(padding, 3))
+        padding[0] = 0
+        stride = cast_tuple(stride, 3)
+        self.conv = nn.Conv3d(chan_in, chan_out, self.kernel_size, stride=stride, padding=padding)
+        self._init_weights(init_method)
+        self.embed_dim = self.chan_out
+        self.patch_size = self.kernel_size
+        self.stride = stride
+        self.padding = padding 
+            
+    def forward(self, x):
+        # 1 + 16   16 as video, 1 as image
+        first_frame_pad = x[:, :, :1, :, :].repeat(
+            (1, 1, self.time_kernel_size - 1, 1, 1)
+        )   # b c t h w
+        x = torch.concatenate((first_frame_pad, x), dim=2)
+
+        def generate_random_conv3d_output(input_shape, out_channels, kernel_size, stride, padding):
+            n, _, d, h, w = input_shape
+            k_d, k_h, k_w = kernel_size
+            s_d, s_h, s_w = stride
+            p_d, p_h, p_w = padding
+
+            d_out = math.floor((d + 2 * p_d - k_d) / s_d + 1)
+            h_out = math.floor((h + 2 * p_h - k_h) / s_h + 1)
+            w_out = math.floor((w + 2 * p_w - k_w) / s_w + 1)
+
+            output_shape = (n, out_channels, d_out, h_out, w_out)
+            return torch.rand(output_shape, dtype=x.dtype, device=x.device)
+        
+      
+        return self.conv(x)
+    
+    def _init_weights(self, init_method):
+        ks = torch.tensor(self.kernel_size)
+        if init_method == "avg":
+            if not (self.kernel_size[1] == 1 and self.kernel_size[2] == 1):
+                logger.error("only support temporal up/down sample")
+                raise ValueError
+            if self.chan_in != self.chan_out:
+                logger.error("chan_in must be equal to chan_out")
+                raise ValueError
+            weight = torch.zeros((self.chan_out, self.chan_in, *self.kernel_size))
+
+            eyes = torch.concat(
+                [
+                    torch.eye(self.chan_in).unsqueeze(-1) * 1 / 3,
+                    torch.eye(self.chan_in).unsqueeze(-1) * 1 / 3,
+                    torch.eye(self.chan_in).unsqueeze(-1) * 1 / 3,
+                ],
+                dim=-1,
+            )
+            weight[:, :, :, 0, 0] = eyes
+
+            self.conv.weight = nn.Parameter(
+                weight,
+                requires_grad=True,
+            )
+        elif init_method == "zero":
+            self.conv.weight = nn.Parameter(
+                torch.zeros((self.chan_out, self.chan_in, *self.kernel_size)),
+                requires_grad=True,
+            )
+        if self.conv.bias is not None:
+            nn.init.constant_(self.conv.bias, 0)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/ops.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/ops.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8e0ecfc628c45bc4fea9f36e28dc72463fcf91a
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/ops.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from einops import rearrange
+
+
+def video_to_image(func):
+    def wrapper(self, x, *args, **kwargs):
+        if x.dim() == 5:
+            t = x.shape[2]
+            x = rearrange(x, "b c t h w -> (b t) c h w")
+            x = func(self, x, *args, **kwargs)
+            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
+        return x
+    return wrapper
+
+
+def nonlinearity(x):
+    return x * torch.sigmoid(x)
+
+
+def cast_tuple(t, length=1):
+    return t if isinstance(t, tuple) else ((t,) * length)
+
+
+def normalize(in_channels, num_groups=32):
+    return torch.nn.GroupNorm(
+        num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True
+    )
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/resnet_block.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/resnet_block.py
new file mode 100644
index 0000000000000000000000000000000000000000..59aaed288b610def6e209456ca0c8b01f61f23f9
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/resnet_block.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+from .ops import nonlinearity, video_to_image, normalize
+from .conv import CausalConv3d
+
+
+class ResnetBlock2D(nn.Module):
+    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
+                 dropout):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = in_channels if out_channels is None else out_channels
+        self.use_conv_shortcut = conv_shortcut
+
+        self.norm1 = normalize(in_channels)
+        self.conv1 = torch.nn.Conv2d(
+            in_channels, out_channels, kernel_size=3, stride=1, padding=1
+        )
+        self.norm2 = normalize(out_channels)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = torch.nn.Conv2d(
+            out_channels, out_channels, kernel_size=3, stride=1, padding=1
+        )
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = torch.nn.Conv2d(
+                    in_channels, out_channels, kernel_size=3, stride=1, padding=1
+                )
+            else:
+                self.nin_shortcut = torch.nn.Conv2d(
+                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
+                )
+                
+    @video_to_image
+    def forward(self, x):
+        h = x
+        h = self.norm1(h)
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        h = self.norm2(h)
+        h = nonlinearity(h)
+        h = self.dropout(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        x = x + h
+        return x
+
+
+class ResnetBlock3D(nn.Module):
+    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False, dropout):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = in_channels if out_channels is None else out_channels
+        self.use_conv_shortcut = conv_shortcut
+
+        self.norm1 = normalize(in_channels)
+        self.conv1 = CausalConv3d(in_channels, out_channels, 3, padding=1)
+        self.norm2 = normalize(out_channels)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = CausalConv3d(out_channels, out_channels, 3, padding=1)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = CausalConv3d(in_channels, out_channels, 3, padding=1)
+            else:
+                self.nin_shortcut = CausalConv3d(in_channels, out_channels, 1, padding=0)
+
+    def forward(self, x):
+        h = x
+        h = self.norm1(h)
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        h = self.norm2(h)
+        h = nonlinearity(h)
+        h = self.dropout(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        return x + h
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/updownsample.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/updownsample.py
new file mode 100644
index 0000000000000000000000000000000000000000..740020931e86b12271c0ed5e380af059dfd35968
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/updownsample.py
@@ -0,0 +1,201 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from typing import Union, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from .ops import cast_tuple
+from .conv import CausalConv3d
+
+
+class SpatialDownsample2x(nn.Module):
+    def __init__(
+        self,
+        chan_in,
+        chan_out,
+        kernel_size: Union[int, Tuple[int]] = (3, 3),
+        stride: Union[int, Tuple[int]] = (2, 2),
+    ):
+        super().__init__()
+        kernel_size = cast_tuple(kernel_size, 2)
+        stride = cast_tuple(stride, 2)
+        self.chan_in = chan_in
+        self.chan_out = chan_out
+        self.kernel_size = kernel_size
+        self.conv = CausalConv3d(
+            self.chan_in,
+            self.chan_out,
+            (1,) + self.kernel_size,
+            stride=(1,) + stride,
+            padding=0
+        )
+
+    def forward(self, x):
+        pad = (0, 1, 0, 1, 0, 0)
+        x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+        x = self.conv(x)
+        return x
+
+
+def mock_interpolate(input_tensor, scale_factor, mode="nearest"):
+    """
+    仿真 interpolate函数,返回一个具有正确shape 的随机数张量。
+
+    :param input_tensor: 输入张量（可以使三维或者四维）
+    :param scale_factor: 缩放因子，元组形式 (height_scale, width_scale) 或者单个数值
+    :param mode: 插值模式(仅用于兼容性， 不影响仿真的输出)
+    :return: 具有预期形状的随机数张量
+    """
+    # 获取输入张量的形状
+    input_shape = input_tensor.shape 
+    ndim = len(input_shape)
+
+    if ndim == 4:
+        # 四维张量(batch_size, channels, height, width)
+        batch_size, channels, height, width = input_shape
+        output_height = int(height * scale_factor[0])
+        output_width = int(width * scale_factor[1])
+        output_shape = (batch_size, channels, output_height, output_width)
+    elif ndim == 3:
+        # 三维张量(batch_size, length, feature_dim)
+        batch_size, length, feature_dim = input_shape
+        if isinstance(scale_factor, tuple):
+            output_length = int(length * scale_factor[0])
+        else:
+            output_length = int(length * scale_factor)
+        output_shape = (batch_size, output_length, feature_dim)
+    else:
+        raise ValueError("仅支持三维或四维张量")
+
+    # 创建一个具有预期形状的随机数张量
+    output_tensor = torch.randn(output_shape)
+
+    return output_tensor
+
+
+class SpatialUpsample2x(nn.Module):
+    def __init__(
+        self,
+        chan_in,
+        chan_out,
+        kernel_size: Union[int, Tuple[int]] = (3, 3),
+        stride: Union[int, Tuple[int]] = (1, 1),
+    ):
+        super().__init__()
+        self.chan_in = chan_in
+        self.chan_out = chan_out
+        self.kernel_size = kernel_size
+        self.conv = CausalConv3d(
+            self.chan_in,
+            self.chan_out,
+            (1,) + self.kernel_size,
+            stride=(1,) + stride,
+            padding=1
+        )
+
+    def forward(self, x):
+        t = x.shape[2]
+        x = rearrange(x, "b c t h w -> b (c t) h w")
+        x = F.interpolate(x, scale_factor=(2, 2), mode="nearest")
+        x = rearrange(x, "b (c t) h w -> b c t h w", t=t)
+        x = self.conv(x)
+        return x
+
+
+class TimeDownsample2x(nn.Module):
+    def __init__(
+        self,
+        chan_in,
+        chan_out,
+        kernel_size: int = 3
+    ):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.conv = nn.AvgPool3d((kernel_size, 1, 1), stride=(2, 1, 1))
+
+    def forward(self, x):
+        first_frame_pad = x[:, :, :1, :, :].repeat(
+            (1, 1, self.kernel_size - 1, 1, 1)
+        )
+        x = torch.concatenate((first_frame_pad, x), dim=2)
+        return self.conv(x)
+
+
+class TimeUpsample2x(nn.Module):
+    def __init__(
+        self,
+        chan_in,
+        chan_out
+    ):
+        super().__init__()
+
+    def forward(self, x):
+        if x.size(2) > 1:
+            x, x_ = x[:, :, :1], x[:, :, 1:]
+            x_ = F.interpolate(x_, scale_factor=(2, 1, 1), mode='trilinear')
+            x = torch.concat([x, x_], dim=2)
+        return x
+
+
+class TimeDownsampleRes2x(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size: int = 3,
+        mix_factor: float = 2,
+    ):
+        super().__init__()
+        self.kernel_size = cast_tuple(kernel_size, 3)
+        self.avg_pool = nn.AvgPool3d((kernel_size, 1, 1), stride=(2, 1, 1))
+        self.conv = nn.Conv3d(
+            in_channels, out_channels, self.kernel_size, stride=(2, 1, 1), padding=(0, 1, 1)
+        )
+        self.mix_factor = torch.nn.Parameter(torch.Tensor([mix_factor]))
+
+    def forward(self, x):
+        alpha = torch.sigmoid(self.mix_factor)
+        first_frame_pad = x[:, :, :1, :, :].repeat(
+            (1, 1, self.kernel_size[0] - 1, 1, 1)
+        )
+        x = torch.concatenate((first_frame_pad, x), dim=2)
+        return alpha * self.avg_pool(x) + (1 - alpha) * self.conv(x)
+
+
+class TimeUpsampleRes2x(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size: int = 3,
+        mix_factor: float = 2,
+    ):
+        super().__init__()
+        self.conv = CausalConv3d(
+            in_channels, out_channels, kernel_size, padding=1
+        )
+        self.mix_factor = torch.nn.Parameter(torch.Tensor([mix_factor]))
+
+    def forward(self, x):
+        alpha = torch.sigmoid(self.mix_factor)
+        if x.size(2) > 1:
+            x, x_ = x[:, :, :1], x[:, :, 1:]
+            x_ = F.interpolate(x_, scale_factor=(2, 1, 1), mode='trilinear')
+            x = torch.concat([x, x_], dim=2)
+        return alpha * x + (1 - alpha) * self.conv(x)
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..e71a9475464da84cc40949530fdbea69befbcc9e
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/layers/utils.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import torch
+import torch.nn as nn
+import numpy as np
+
+
+def rearrange(x: torch.Tensor, datatype, t, s):
+    rearrange_function = {
+        'B (T S) C -> (B S) T C': rearrange_b_ts_c_2_bs_t_c,
+        '(B S) T C -> B (T S) C': rearrange_bs_t_c_2_b_ts_c,
+        'B (T S) C -> (B T) S C': rearrange_b_ts_c_2_bt_s_c,
+        '(B T) S C -> B (T S) C': rearrange_bt_s_c_2_b_ts_c,
+        'B (T S) C -> B T S C': rearrange_b_ts_c_2_b_t_s_c,
+        'B T S C -> B (T S) C': rearrange_b_t_s_c_2_b_ts_c,
+    }
+    rearrange_func = rearrange_function.get(datatype, None)
+    if rearrange_func is None:
+        raise ValueError(f"Unsupported rearrange type: {datatype}")
+    return rearrange_func(x, t, s)
+
+
+def rearrange_b_ts_c_2_bs_t_c(x, t, s):
+    shape = x.shape
+    x = x.view(shape[0], t, s, shape[-1])
+    x = x.transpose(1, 2)
+    return x.reshape(shape[0] * s, t, shape[-1])
+
+
+def rearrange_bs_t_c_2_b_ts_c(x, t, s):
+    shape = x.shape
+    x = x.view(-1, s, t, shape[-1])
+    x = x.transpose(1, 2)
+    return x.reshape(-1, t * s, shape[-1])
+
+
+def rearrange_b_ts_c_2_bt_s_c(x, t, s):
+    shape = x.shape
+    return x.reshape(-1, s, shape[-1])
+
+
+def rearrange_bt_s_c_2_b_ts_c(x, t, s):
+    shape = x.shape
+    return x.reshape(-1, t * s, shape[-1])
+
+
+def rearrange_b_ts_c_2_b_t_s_c(x, t, s):
+    shape = x.shape
+    return x.reshape(shape[0], t, s, shape[-1])
+
+
+def rearrange_b_t_s_c_2_b_ts_c(x, t, s):
+    shape = x.shape
+    return x.reshape(shape[0], t * s, shape[-1])
+
+
+def rearrange_flatten_t(x):
+    x_shape = x.shape
+    x = x.transpose(1, 2)
+    return x.view((x_shape[0] * x_shape[2]), x_shape[1], x_shape[3], x_shape[4])
+
+
+def rearrange_unflatten_t(x, b):
+    x_shape = x.shape
+    x = x.view(b, x_shape[0] // b, x_shape[1], x_shape[2], x_shape[3])
+    return x.transpose(1, 2)
+
+
+def get_2d_sincos_pos_embed(
+        embed_dim, grid_size, cls_token=False, interpolation_scale=1.0, base_size=16
+):
+    """
+    grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or
+    [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    extra_tokens = 0
+    if isinstance(grid_size, int):
+        grid_size = (grid_size, grid_size)
+
+    grid_h = np.arange(grid_size[0], dtype=np.float32) / (grid_size[0] / base_size) / interpolation_scale
+    grid_w = np.arange(grid_size[1], dtype=np.float32) / (grid_size[1] / base_size) / interpolation_scale
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+
+    grid = grid.reshape([2, 1, grid_size[1], grid_size[0]])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token and extra_tokens > 0:
+        pos_embed = np.concatenate([np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    if embed_dim % 2 != 0:
+        raise ValueError("embed_dim must be divisible by 2")
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+
+    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed(
+        embed_dim, length, interpolation_scale=1.0, base_size=16
+):
+    pos = torch.arange(0, length).unsqueeze(1) / interpolation_scale
+    pos_embed = get_1d_sincos_pos_embed_from_grid(embed_dim, pos)
+    return pos_embed
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)
+    """
+    if embed_dim % 2 != 0:
+        raise ValueError("embed_dim must be divisible by 2")
+
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000 ** omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
+
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..30af36f396102c760fe85408124eba70c9c93bb7
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/__init__.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..572fef02cf526e7fc37fa4d1fc2ca7cd53239f48
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/__init__.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling_causalvae import CausalVAEModelWrapper
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/modeling_causalvae.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/modeling_causalvae.py
new file mode 100644
index 0000000000000000000000000000000000000000..abb3400a9b9476c3c9298f473fbc766046092ffa
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/causalvae/modeling_causalvae.py
@@ -0,0 +1,643 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import logging
+from typing import Tuple, Optional, Union
+import glob
+import importlib
+import torch
+import torch.nn as nn
+import numpy as np
+from einops import rearrange
+import pytorch_lightning as pl
+from diffusers.configuration_utils import register_to_config
+from diffusers import ModelMixin, ConfigMixin
+from opensoraplan.layers import nonlinearity, normalize
+from opensoraplan.utils.utils import path_check
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+Module = str
+MODULES_BASE = "opensoraplan.layers."
+SPATIAL_DOWNSAMPLE = "Downsample"
+RESNET_BLOCK_2D = "ResnetBlock2D"
+RESNET_BLOCK_3D = "ResnetBlock3D"
+SPATIAL_UPSAMPLE_2X = "SpatialUpsample2x"
+SPATIAL_DOWNSAMPLE_2X = "SpatialDownsample2x"
+TIME_DOWNSAMPLE_2X = "TimeDownsample2x"
+CAUSAL_CONV_3D = "CausalConv3d"
+LATENTS_SCALING_FACTOR = 0.18215
+
+
+def resolve_str_to_obj(str_val, append=True):
+    if append:
+        str_val = MODULES_BASE + str_val
+    module_name, class_name = str_val.rsplit('.', 1)
+    module = importlib.import_module(module_name)
+    return getattr(module, class_name)
+
+
+class VideoBaseAePl(pl.LightningModule, ModelMixin, ConfigMixin):
+    config_name = "config.json"
+
+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+
+    @property
+    def num_training_steps(self) -> int:
+        """Total training steps inferred from datamodule and devices."""
+        if self.trainer.max_steps:
+            return self.trainer.max_steps
+
+        limit_batches = self.trainer.limit_train_batches
+        batches = len(self.train_dataloader())
+        batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)
+
+        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
+        if self.trainer.tpu_cores:
+            num_devices = max(num_devices, self.trainer.tpu_cores)
+
+        effective_accum = self.trainer.accumulate_grad_batches * num_devices
+        return (batches // effective_accum) * self.trainer.max_epochs
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], **kwargs):
+        ckpt_files = glob.glob(os.path.join(pretrained_model_name_or_path, '*.ckpt'))
+        if ckpt_files:
+            # Adapt to PyTorch Lightning
+            last_ckpt_file = ckpt_files[-1]
+            config_file = os.path.join(pretrained_model_name_or_path, cls.config_name)
+            model = cls.from_config(config_file)
+            logger.info("init from %s", last_ckpt_file)
+            model.init_from_ckpt(last_ckpt_file)
+            return model
+        else:
+            return super().from_pretrained(pretrained_model_name_or_path, **kwargs)
+
+
+class DiagonalGaussianDistribution(object):
+    def __init__(self, parameters, deterministic=False):
+        self.parameters = parameters
+        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
+        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
+        self.deterministic = deterministic
+        self.std = torch.exp(0.5 * self.logvar)
+        self.var = torch.exp(self.logvar)
+        if self.deterministic:
+            self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device)
+
+    def sample(self):
+        x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device)
+        return x
+
+    def kl(self, other=None):
+        if self.deterministic:
+            return torch.Tensor([0.])
+        else:
+            if other is None:
+                return 0.5 * torch.sum(torch.pow(self.mean, 2)
+                                       + self.var - 1.0 - self.logvar,
+                                       dim=[1, 2, 3])
+            else:
+                return 0.5 * torch.sum(
+                    torch.pow(self.mean - other.mean, 2) / other.var
+                    + self.var / other.var - 1.0 - self.logvar + other.logvar,
+                    dim=[1, 2, 3])
+
+    def nll(self, sample):
+        dims = [1, 2, 3]
+        if self.deterministic:
+            return torch.Tensor([0.])
+        logtwopi = np.log(2.0 * np.pi)
+        return 0.5 * torch.sum(
+            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
+            dim=dims)
+
+    def mode(self):
+        return self.mean
+
+
+class Encoder(nn.Module):
+    def __init__(
+        self,
+        z_channels: int,
+        hidden_size: int,
+        hidden_size_mult: Tuple[int] = (1, 2, 4, 4),
+        attn_resolutions: Tuple[int] = (16,),
+        conv_in: Module = "Conv2d",
+        conv_out: Module = "CasualConv3d",
+        attention: Module = "AttnBlock",
+        resnet_blocks: Tuple[Module] = (
+            RESNET_BLOCK_2D,
+            RESNET_BLOCK_2D,
+            RESNET_BLOCK_2D,
+            RESNET_BLOCK_3D,
+        ),
+        spatial_downsample: Tuple[Module] = (
+            SPATIAL_DOWNSAMPLE,
+            SPATIAL_DOWNSAMPLE,
+            SPATIAL_DOWNSAMPLE,
+            "",
+        ),
+        temporal_downsample: Tuple[Module] = ("", "", "TimeDownsampleRes2x", ""),
+        mid_resnet: Module = RESNET_BLOCK_3D,
+        dropout: float = 0.0,
+        resolution: int = 256,
+        num_res_blocks: int = 2,
+        double_z: bool = True,
+    ) -> None:
+        super().__init__()
+        if len(resnet_blocks) != len(hidden_size_mult):
+            logger.error("resnet_blocks size does not equal to hidden_size_mult size.")
+            raise ValueError
+        # ---- Config ----
+        self.num_resolutions = len(hidden_size_mult)
+        self.resolution = resolution
+        self.num_res_blocks = num_res_blocks
+
+        # ---- In ----
+        self.conv_in = resolve_str_to_obj(conv_in)(
+            3, hidden_size, kernel_size=3, stride=1, padding=1
+        )
+
+        # ---- Downsample ----
+        curr_res = resolution
+        in_ch_mult = (1,) + tuple(hidden_size_mult)
+        self.in_ch_mult = in_ch_mult
+        self.down = nn.ModuleList()
+        for i_level in range(self.num_resolutions):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = hidden_size * in_ch_mult[i_level]
+            block_out = hidden_size * hidden_size_mult[i_level]
+            for _ in range(self.num_res_blocks):
+                block.append(
+                    resolve_str_to_obj(resnet_blocks[i_level])(
+                        in_channels=block_in,
+                        out_channels=block_out,
+                        dropout=dropout,
+                    )
+                )
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(resolve_str_to_obj(attention)(block_in))
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if spatial_downsample[i_level]:
+                down.downsample = resolve_str_to_obj(spatial_downsample[i_level])(
+                    block_in, block_in
+                )
+                curr_res = curr_res // 2
+            if temporal_downsample[i_level]:
+                down.time_downsample = resolve_str_to_obj(temporal_downsample[i_level])(
+                    block_in, block_in
+                )
+            self.down.append(down)
+
+        # ---- Mid ----
+        self.mid = nn.Module()
+        self.mid.block_1 = resolve_str_to_obj(mid_resnet)(
+            in_channels=block_in,
+            out_channels=block_in,
+            dropout=dropout,
+        )
+        self.mid.attn_1 = resolve_str_to_obj(attention)(block_in)
+        self.mid.block_2 = resolve_str_to_obj(mid_resnet)(
+            in_channels=block_in,
+            out_channels=block_in,
+            dropout=dropout,
+        )
+        # ---- Out ----
+        self.norm_out = normalize(block_in)
+        self.conv_out = resolve_str_to_obj(conv_out)(
+            block_in,
+            2 * z_channels if double_z else z_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+        )
+
+    def forward(self, x):
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_resolutions):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1])
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                hs.append(h)
+            if hasattr(self.down[i_level], "downsample"):
+                hs.append(self.down[i_level].downsample(hs[-1]))
+            if hasattr(self.down[i_level], "time_downsample"):
+                hs_down = self.down[i_level].time_downsample(hs[-1])
+                hs.append(hs_down)
+
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        return h
+
+
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        z_channels: int,
+        hidden_size: int,
+        hidden_size_mult: Tuple[int] = (1, 2, 4, 4),
+        attn_resolutions: Tuple[int] = (16,),
+        conv_in: Module = "Conv2d",
+        conv_out: Module = "CasualConv3d",
+        attention: Module = "AttnBlock",
+        resnet_blocks: Tuple[Module] = (
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+        ),
+        spatial_upsample: Tuple[Module] = (
+            "",
+            SPATIAL_UPSAMPLE_2X,
+            SPATIAL_UPSAMPLE_2X,
+            SPATIAL_UPSAMPLE_2X,
+        ),
+        temporal_upsample: Tuple[Module] = ("", "", "", "TimeUpsampleRes2x"),
+        mid_resnet: Module = RESNET_BLOCK_3D,
+        dropout: float = 0.0,
+        resolution: int = 256,
+        num_res_blocks: int = 2,
+    ):
+        super().__init__()
+        # ---- Config ----
+        self.num_resolutions = len(hidden_size_mult)
+        self.resolution = resolution
+        self.num_res_blocks = num_res_blocks
+
+        # ---- In ----
+        block_in = hidden_size * hidden_size_mult[self.num_resolutions - 1]
+        curr_res = resolution // 2 ** (self.num_resolutions - 1)
+        self.conv_in = resolve_str_to_obj(conv_in)(
+            z_channels, block_in, kernel_size=3, padding=1
+        )
+
+        # ---- Mid ----
+        self.mid = nn.Module()
+        self.mid.block_1 = resolve_str_to_obj(mid_resnet)(
+            in_channels=block_in,
+            out_channels=block_in,
+            dropout=dropout,
+        )
+        self.mid.attn_1 = resolve_str_to_obj(attention)(block_in)
+        self.mid.block_2 = resolve_str_to_obj(mid_resnet)(
+            in_channels=block_in,
+            out_channels=block_in,
+            dropout=dropout,
+        )
+
+        # ---- Upsample ----
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_resolutions)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = hidden_size * hidden_size_mult[i_level]
+            for _ in range(self.num_res_blocks + 1):
+                block.append(
+                    resolve_str_to_obj(resnet_blocks[i_level])(
+                        in_channels=block_in,
+                        out_channels=block_out,
+                        dropout=dropout,
+                    )
+                )
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(resolve_str_to_obj(attention)(block_in))
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if spatial_upsample[i_level]:
+                up.upsample = resolve_str_to_obj(spatial_upsample[i_level])(
+                    block_in, block_in
+                )
+                curr_res = curr_res * 2
+            if temporal_upsample[i_level]:
+                up.time_upsample = resolve_str_to_obj(temporal_upsample[i_level])(
+                    block_in, block_in
+                )
+            self.up.insert(0, up)
+
+        # ---- Out ----
+        self.norm_out = normalize(block_in)
+        self.conv_out = resolve_str_to_obj(conv_out)(
+            block_in, 3, kernel_size=3, padding=1
+        )
+
+    def forward(self, z):
+        h = self.conv_in(z)
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+
+        for i_level in reversed(range(self.num_resolutions)):
+            for i_block in range(self.num_res_blocks + 1):
+                h = self.up[i_level].block[i_block](h)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+            if hasattr(self.up[i_level], "upsample"):
+                h = self.up[i_level].upsample(h)
+            if hasattr(self.up[i_level], "time_upsample"):
+                h = self.up[i_level].time_upsample(h)
+
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        return h
+
+
+class CausalVAEModel(VideoBaseAePl):
+    @register_to_config
+    def __init__(
+        self,
+        lr: float = 1e-5,
+        hidden_size: int = 128,
+        z_channels: int = 4,
+        hidden_size_mult: Tuple[int] = (1, 2, 4, 4),
+        attn_resolutions: Tuple[int] = None,
+        dropout: float = 0.0,
+        resolution: int = 256,
+        double_z: bool = True,
+        embed_dim: int = 4,
+        num_res_blocks: int = 2,
+        q_conv: str = CAUSAL_CONV_3D,
+        encoder_conv_in: Module = CAUSAL_CONV_3D,
+        encoder_conv_out: Module = CAUSAL_CONV_3D,
+        encoder_attention: Module = "AttnBlock3D",
+        encoder_resnet_blocks: Tuple[Module] = (
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+        ),
+        encoder_spatial_downsample: Tuple[Module] = (
+            SPATIAL_DOWNSAMPLE_2X,
+            SPATIAL_DOWNSAMPLE_2X,
+            SPATIAL_DOWNSAMPLE_2X,
+            "",
+        ),
+        encoder_temporal_downsample: Tuple[Module] = (
+            "",
+            TIME_DOWNSAMPLE_2X,
+            TIME_DOWNSAMPLE_2X,
+            "",
+        ),
+        encoder_mid_resnet: Module = RESNET_BLOCK_3D,
+        decoder_conv_in: Module = CAUSAL_CONV_3D,
+        decoder_conv_out: Module = CAUSAL_CONV_3D,
+        decoder_attention: Module = "AttnBlock3D",
+        decoder_resnet_blocks: Tuple[Module] = (
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+            RESNET_BLOCK_3D,
+        ),
+        decoder_spatial_upsample: Tuple[Module] = (
+            "",
+            SPATIAL_UPSAMPLE_2X,
+            SPATIAL_UPSAMPLE_2X,
+            SPATIAL_UPSAMPLE_2X,
+        ),
+        decoder_temporal_upsample: Tuple[Module] = ("", "", "TimeUpsample2x", "TimeUpsample2x"),
+        decoder_mid_resnet: Module = RESNET_BLOCK_3D,
+    ) -> None:
+        super().__init__()
+        self.tile_sample_min_size = 256
+        self.tile_sample_min_size_t = 65
+        self.tile_latent_min_size = int(self.tile_sample_min_size / (2 ** (len(hidden_size_mult) - 1)))
+        self.tile_overlap_factor = 0.25
+        self.use_tiling = False
+
+        self.learning_rate = lr
+        self.lr_g_factor = 1.0
+
+        self.encoder = Encoder(
+            z_channels=z_channels,
+            hidden_size=hidden_size,
+            hidden_size_mult=hidden_size_mult,
+            attn_resolutions=attn_resolutions,
+            conv_in=encoder_conv_in,
+            conv_out=encoder_conv_out,
+            attention=encoder_attention,
+            resnet_blocks=encoder_resnet_blocks,
+            spatial_downsample=encoder_spatial_downsample,
+            temporal_downsample=encoder_temporal_downsample,
+            mid_resnet=encoder_mid_resnet,
+            dropout=dropout,
+            resolution=resolution,
+            num_res_blocks=num_res_blocks,
+            double_z=double_z,
+        )
+
+        self.decoder = Decoder(
+            z_channels=z_channels,
+            hidden_size=hidden_size,
+            hidden_size_mult=hidden_size_mult,
+            attn_resolutions=attn_resolutions,
+            conv_in=decoder_conv_in,
+            conv_out=decoder_conv_out,
+            attention=decoder_attention,
+            resnet_blocks=decoder_resnet_blocks,
+            spatial_upsample=decoder_spatial_upsample,
+            temporal_upsample=decoder_temporal_upsample,
+            mid_resnet=decoder_mid_resnet,
+            dropout=dropout,
+            resolution=resolution,
+            num_res_blocks=num_res_blocks,
+        )
+
+        quant_conv_cls = resolve_str_to_obj(q_conv)
+        self.quant_conv = quant_conv_cls(2 * z_channels, 2 * embed_dim, 1)
+        self.post_quant_conv = quant_conv_cls(embed_dim, z_channels, 1)
+        self.patch_size = (1, 8, 8)
+
+    def encode(self, x):
+        if self.use_tiling and (
+            x.shape[-1] > self.tile_sample_min_size
+            or x.shape[-2] > self.tile_sample_min_size
+        ):
+            return self.tiled_encode2d(x)
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+
+    def decode(self, z):
+        if self.use_tiling and (
+            z.shape[-1] > self.tile_latent_min_size
+            or z.shape[-2] > self.tile_latent_min_size
+        ):
+            return self.tiled_decode2d(z)
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        return dec
+
+    def forward(self, model_inputs, sample_posterior=True):
+        posterior = self.encode(model_inputs)
+        if sample_posterior:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        dec = self.decode(z)
+        return dec, posterior
+
+    def blend_v(
+        self, a: torch.Tensor, b: torch.Tensor, blend_extent: int
+    ) -> torch.Tensor:
+        blend_extent = min(a.shape[3], b.shape[3], blend_extent)
+        alphas = torch.linspace(0, 1, blend_extent, device=a.device).view(-1, 1).expand(-1, a.shape[4])
+        b[:, :, :, :blend_extent, :] = (
+                    a[:, :, :, -blend_extent:, :] * (1 - alphas) + b[:, :, :, :blend_extent, :] * alphas)
+        return b
+
+    def blend_h(
+        self, a: torch.Tensor, b: torch.Tensor, blend_extent: int
+    ) -> torch.Tensor:
+        blend_extent = min(a.shape[4], b.shape[4], blend_extent)
+        alphas = torch.linspace(0, 1, blend_extent, device=a.device).expand(a.shape[3], -1)
+        b[:, :, :, :, :blend_extent] = (
+                    a[:, :, :, :, -blend_extent:] * (1 - alphas) + b[:, :, :, :, :blend_extent] * alphas)
+        return b
+
+    def tiled_encode2d(self, x):
+        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_latent_min_size - blend_extent
+
+        # Split the image into 512x512 tiles and encode them separately.
+        rows = []
+        for i in range(0, x.shape[3], overlap_size):
+            row = []
+            for j in range(0, x.shape[4], overlap_size):
+                tile = x[
+                    :,
+                    :,
+                    :,
+                    i: i + self.tile_sample_min_size,
+                    j: j + self.tile_sample_min_size,
+                ]
+                tile = self.encoder(tile)
+                tile = self.quant_conv(tile)
+                row.append(tile)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=4))
+
+        moments = torch.cat(result_rows, dim=3)
+        posterior = DiagonalGaussianDistribution(moments)
+
+        return posterior
+
+    def tiled_decode2d(self, z):
+
+        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_sample_min_size - blend_extent
+
+        # Split z into overlapping 64x64 tiles and decode them separately.
+        # The tiles have an overlap to avoid seams between tiles.
+        rows = []
+        for i in range(0, z.shape[3], overlap_size):
+            row = []
+            for j in range(0, z.shape[4], overlap_size):
+                tile = z[
+                    :,
+                    :,
+                    :,
+                    i: i + self.tile_latent_min_size,
+                    j: j + self.tile_latent_min_size,
+                ]
+                tile = self.post_quant_conv(tile)
+                decoded = self.decoder(tile)
+                row.append(decoded)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=4))
+
+        dec = torch.cat(result_rows, dim=3)
+        return dec
+
+    def enable_tiling(self, use_tiling: bool = True):
+        self.use_tiling = use_tiling
+
+    def disable_tiling(self):
+        self.enable_tiling(False)
+
+    def init_from_ckpt(self, path, ignore_keys, remove_loss=True):
+        sd = torch.load(path, map_location="cpu")
+        if "state_dict" in sd:
+            sd = sd["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    logger.error("Deleting key %s from state_dict.", k)
+                    del sd[k]
+            if remove_loss and "loss" in k:
+                del sd[k]
+        self.load_state_dict(sd, strict=False)
+
+
+class CausalVAEModelWrapper(nn.Module):
+    def __init__(self, vae, latent_size):
+        super(CausalVAEModelWrapper, self).__init__()
+        self.vae = vae
+        self.latent_size = latent_size
+
+    @classmethod
+    def from_pretrained(cls, model_path, latent_size, cache_dir, **kwargs):
+        real_path = path_check(model_path)
+        if len(latent_size) != 2 or latent_size[0] <= 0 or latent_size[1] <= 0:
+            raise ValueError("latent_size shape or value is invalid.")
+        causal_vae = CausalVAEModel.from_pretrained(real_path, cache_dir=cache_dir, **kwargs)
+        return cls(causal_vae, latent_size)
+
+    def decode(self, x):
+        x = self.vae.decode(x / LATENTS_SCALING_FACTOR)
+        x = rearrange(x, 'b c t h w -> b t c h w').contiguous()
+        return x
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/comm.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/comm.py
new file mode 100644
index 0000000000000000000000000000000000000000..aee624d833803998c32803f1088f49c9fc0221a7
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/comm.py
@@ -0,0 +1,144 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# --------------------------------------------------------
+# References:
+# DSP : https://github.com/NUS-HPC-AI-Lab/VideoSys
+# --------------------------------------------------------
+
+from dataclasses import dataclass
+import logging
+import torch
+import torch.distributed as dist
+from opensoraplan.models.parallel_mgr import get_sequence_parallel_size
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class SplitParams:
+    input_: torch.Tensor
+    dim: int
+    grad_scale: str
+    pad: int
+
+
+def _all_to_all_func(input_, world_size, process_group, scatter_dim=2, gather_dim=1):
+    input_list = [t.contiguous() for t in torch.tensor_split(input_, world_size, scatter_dim)]
+    output_list = [torch.empty_like(input_list[0]) for _ in range(world_size)]
+    dist.all_to_all(output_list, input_list, group=process_group)
+    return torch.cat(output_list, dim=gather_dim).contiguous()
+
+
+def split_sequence(input_, process_group: dist.ProcessGroup, dim: int, pad: int):
+    world_size = dist.get_world_size(process_group)
+    rank = dist.get_rank(process_group)
+    if world_size == 1:
+        return input_
+    
+    if pad > 0:
+        pad_size = list(input_.shape)
+        pad_size[dim] = pad
+        input_ = torch.cat([input_, torch.zeros(pad_size, dtype=input_.dtype, device=input_.device)], dim=dim)
+    
+    dim_size = input_.size(dim)
+    if dim_size % world_size != 0:
+        logger.error(f"dim_size ( %d ) is not divisible by world_size ( %d )", dim_size, world_size)
+        raise ValueError(f"dim_size ({dim_size}) is not divisible by world_size ({world_size})")
+    tensor_list = torch.split(input_, dim_size // world_size, dim=dim)
+    output = tensor_list[rank].contiguous()
+    return output
+
+
+def gather_sequence(input_, process_group: dist.ProcessGroup, dim: int, pad: int):
+    input_ = input_.contiguous()
+    world_size = dist.get_world_size(process_group)
+    rank = dist.get_rank(process_group)
+    if world_size == 1:
+        return input_
+    
+    #all gather
+    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
+    torch.distributed.all_gather(tensor_list, input_, group=process_group)
+
+    #concat
+    output = torch.cat(tensor_list, dim=dim)
+
+    if pad > 0:
+        output = output.narrow(dim, 0, output.size(dim) - pad)
+    
+    return output
+
+# ======
+# Pad
+# ======
+
+SPTIAL_PAD = 0
+TEMPORAL_PAD = 0
+
+
+def set_spatial_pad(dim_size: int):
+    sp_size = get_sequence_parallel_size()
+    pad = (sp_size - (dim_size % sp_size)) % sp_size
+    global SPTIAL_PAD
+    SPTIAL_PAD = pad
+
+
+def get_spatial_pad() -> int:
+    return SPTIAL_PAD
+
+
+def set_temporal_pad(dim_size: int):
+    sp_size = get_sequence_parallel_size()
+    pad = (sp_size - (dim_size % sp_size)) % sp_size
+    global TEMPORAL_PAD
+    TEMPORAL_PAD = pad
+
+
+def get_temporal_pad() -> int:
+    return TEMPORAL_PAD
+
+
+def all_to_all_with_pad(
+    input_: torch.Tensor,
+    process_group: dist.ProcessGroup,
+    **kwargs
+):  
+    scatter_dim = kwargs.get("scatter_dim", 2)
+    gather_dim = kwargs.get("gather_dim", 1)
+    scatter_pad = kwargs.get("scatter_pad", 0)
+    gather_pad = kwargs.get("gather_pad", 0)
+
+    if scatter_pad > 0:
+        pad_shape = list(input_.shape)
+        pad_shape[scatter_dim] = scatter_pad
+        pad_tensor = torch.zeros(pad_shape, device=input_.device, dtype=input_.dtype)
+        input_ = torch.cat([input_, pad_tensor], dim=scatter_dim)
+
+    world_size = dist.get_world_size(process_group)
+    if input_.shape[scatter_dim] % world_size != 0:
+        logger.error(f"dim_size ( %d ) is not divisible by world_size ( %d )", input_.shape[scatter_dim], world_size)
+        raise ValueError(f"dim_size ({input_.shape[scatter_dim]}) is not divisible by world_size ({world_size})")
+
+    input_ = _all_to_all_func(input_, world_size, process_group, scatter_dim, gather_dim)
+
+    if gather_pad > 0:
+        input_ = input_.narrow(gather_dim, 0, input_.size(gather_dim) - gather_pad)
+    
+    return input_
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..746552433ecd037aeed4a4e2ca1ff503a0b43ff8
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/__init__.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling_latte import LatteT2V
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/latte_modules.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/latte_modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7798a79540bfb95f8bcc6872d89f828b7957030
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/latte_modules.py
@@ -0,0 +1,1009 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os 
+from dataclasses import dataclass
+from importlib import import_module
+from typing import Any, Dict, Optional, Tuple, Callable
+from diffusers.utils import USE_PEFT_BACKEND, BaseOutput, deprecate, is_xformers_available
+from diffusers.models.lora import LoRACompatibleLinear
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from diffusers.utils.torch_utils import maybe_allow_in_graph
+from diffusers.models.embeddings import SinusoidalPositionalEmbedding, TimestepEmbedding, Timesteps
+from diffusers.models.normalization import AdaLayerNorm, AdaLayerNormZero
+from diffusers.models.attention_processor import SpatialNorm, LORA_ATTENTION_PROCESSORS, \
+    CustomDiffusionAttnProcessor, CustomDiffusionXFormersAttnProcessor, CustomDiffusionAttnProcessor2_0, \
+    AttnAddedKVProcessor, AttnAddedKVProcessor2_0, SlicedAttnAddedKVProcessor, XFormersAttnAddedKVProcessor, \
+    LoRAAttnAddedKVProcessor, LoRAXFormersAttnProcessor, XFormersAttnProcessor, LoRAAttnProcessor2_0, \
+    LoRAAttnProcessor, AttnProcessor, SlicedAttnProcessor, logger
+from diffusers.models.activations import GEGLU, GELU, ApproximateGELU
+from opensoraplan.layers.utils import get_2d_sincos_pos_embed
+
+if is_xformers_available(): # "1"仿真环境
+    import xformers
+    import xformers.ops
+else:
+    xformers = None
+
+ATTR_PROCESSOR = "processor"
+CUDA_DEVICE = "cuda"
+ATTR_SDPA = "scaled_dot_product_attention"
+
+
+@maybe_allow_in_graph
+class Attention(nn.Module):
+    def __init__(
+        self,
+        query_dim: int,
+        cross_attention_dim: Optional[int] = None,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        bias: bool = False,
+        upcast_attention: bool = False,
+        upcast_softmax: bool = False,
+        cross_attention_norm: Optional[str] = None,
+        cross_attention_norm_num_groups: int = 32,
+        added_kv_proj_dim: Optional[int] = None,
+        norm_num_groups: Optional[int] = None,
+        spatial_norm_dim: Optional[int] = None,
+        out_bias: bool = True,
+        scale_qk: bool = True,
+        only_cross_attention: bool = False,
+        eps: float = 1e-5,
+        rescale_output_factor: float = 1.0,
+        residual_connection: bool = False,
+        _from_deprecated_attn_block: bool = False,
+        processor: Optional["AttnProcessor"] = None,
+        attention_mode: str = 'xformers',
+    ):
+        super().__init__()
+        self.inner_dim = dim_head * heads
+        self.cross_attention_dim = cross_attention_dim if cross_attention_dim is not None else query_dim
+        self.upcast_attention = upcast_attention
+        self.upcast_softmax = upcast_softmax
+        self.rescale_output_factor = rescale_output_factor
+        self.residual_connection = residual_connection
+        self.dropout = dropout
+
+        # we make use of this private variable to know whether this class is loaded
+        # with an deprecated state dict so that we can convert it on the fly
+        self._from_deprecated_attn_block = _from_deprecated_attn_block
+
+        self.scale_qk = scale_qk
+        self.scale = dim_head**-0.5 if self.scale_qk else 1.0
+
+        self.heads = heads
+        # for slice_size > 0 the attention score computation
+        # is split across the batch axis to save memory
+        # You can set slice_size with `set_attention_slice`
+        self.sliceable_head_dim = heads
+
+        self.added_kv_proj_dim = added_kv_proj_dim
+        self.only_cross_attention = only_cross_attention
+
+        if self.added_kv_proj_dim is None and self.only_cross_attention:
+            raise ValueError(
+                "`only_cross_attention` can only be set to True if `added_kv_proj_dim` is not None. Make sure to set "
+                "either `only_cross_attention=False` or define `added_kv_proj_dim`."
+            )
+
+        if norm_num_groups is not None:
+            self.group_norm = nn.GroupNorm(num_channels=query_dim, num_groups=norm_num_groups, eps=eps, affine=True)
+        else:
+            self.group_norm = None
+
+        if spatial_norm_dim is not None:
+            self.spatial_norm = SpatialNorm(f_channels=query_dim, zq_channels=spatial_norm_dim)
+        else:
+            self.spatial_norm = None
+
+        if cross_attention_norm is None:
+            self.norm_cross = None
+        elif cross_attention_norm == "layer_norm":
+            self.norm_cross = nn.LayerNorm(self.cross_attention_dim)
+        elif cross_attention_norm == "group_norm":
+            if self.added_kv_proj_dim is not None:
+                # The given `encoder_hidden_states` are initially of shape
+                # (batch_size, seq_len, added_kv_proj_dim) before being projected
+                # to (batch_size, seq_len, cross_attention_dim). The norm is applied
+                # before the projection, so we need to use `added_kv_proj_dim` as
+                # the number of channels for the group norm.
+                norm_cross_num_channels = added_kv_proj_dim
+            else:
+                norm_cross_num_channels = self.cross_attention_dim
+
+            self.norm_cross = nn.GroupNorm(
+                num_channels=norm_cross_num_channels, num_groups=cross_attention_norm_num_groups, eps=1e-5, affine=True
+            )
+        else:
+            raise ValueError(
+                f"unknown cross_attention_norm: {cross_attention_norm}. Should be None, 'layer_norm' or 'group_norm'"
+            )
+
+        if USE_PEFT_BACKEND:
+            linear_cls = nn.Linear
+        else:
+            linear_cls = LoRACompatibleLinear
+
+        self.to_q = linear_cls(query_dim, self.inner_dim, bias=bias)
+
+        if not self.only_cross_attention:
+            # only relevant for the `AddedKVProcessor` classes
+            self.to_k = linear_cls(self.cross_attention_dim, self.inner_dim, bias=bias)
+            self.to_v = linear_cls(self.cross_attention_dim, self.inner_dim, bias=bias)
+        else:
+            self.to_k = None
+            self.to_v = None
+
+        if self.added_kv_proj_dim is not None:
+            self.add_k_proj = linear_cls(added_kv_proj_dim, self.inner_dim)
+            self.add_v_proj = linear_cls(added_kv_proj_dim, self.inner_dim)
+
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(linear_cls(self.inner_dim, query_dim, bias=out_bias))
+        self.to_out.append(nn.Dropout(dropout))
+
+        # set attention processor
+        # We use the AttnProcessor2_0 by default when torch 2.x is used which uses
+        # torch.nn.functional.scaled_dot_product_attention for native Flash/memory_efficient_attention
+        # but only if it has the default `scale` argument.
+        if processor is None:
+            processor = (
+                AttnProcessor2(attention_mode) if
+                hasattr(F, ATTR_SDPA) and self.scale_qk else AttnProcessor()
+            )
+        self.set_processor(processor)
+
+    def set_processor(self, processor: "AttnProcessor", _remove_lora: bool = False) -> None:
+        condition_processor = not USE_PEFT_BACKEND and hasattr(self, ATTR_PROCESSOR)
+        condition_lora = _remove_lora and self.to_q.lora_layer is not None
+        if condition_processor and condition_lora:
+            deprecate(
+                "set_processor to offload LoRA",
+                "0.26.0",
+                "In detail, removing LoRA layers via calling `set_default_attn_processor` is deprecated. "
+                "Please make sure to call `pipe.unload_lora_weights()` instead.",
+            )
+
+            for module in self.modules():
+                if hasattr(module, "set_lora_layer"):
+                    module.set_lora_layer(None)
+
+        # if current processor is in `self._modules` and if passed `processor` is not, we need to
+        # pop `processor` from `self._modules`
+        if (
+            hasattr(self, ATTR_PROCESSOR)
+            and isinstance(self.processor, torch.nn.Module)
+            and not isinstance(processor, torch.nn.Module)
+        ):
+            logger.info(f"You are removing possibly trained weights of {self.processor} with {processor}")
+            self._modules.pop(ATTR_PROCESSOR)
+
+        self.processor = processor
+
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        **cross_attention_kwargs,
+    ) -> torch.Tensor:
+        # The `Attention` class can call different attention processors / attention functions
+        # here we simply pass along all tensors to the selected processor class
+        # For standard processors that are defined here, `**cross_attention_kwargs` is empty
+        return self.processor(
+            self,
+            hidden_states,
+            encoder_hidden_states=encoder_hidden_states,
+            attention_mask=attention_mask,
+            **cross_attention_kwargs,
+        )
+
+    def prepare_attention_mask(
+        self, attention_mask: torch.Tensor, target_length: int, batch_size: int, out_dim: int = 3
+    ) -> torch.Tensor:
+        head_size = self.heads
+        if attention_mask is None:
+            return attention_mask
+
+        current_length: int = attention_mask.shape[-1]
+        if current_length != target_length:
+            if attention_mask.device.type == "mps":
+                # HACK: MPS: Does not support padding by greater than dimension of input tensor.
+                # Instead, we can manually construct the padding tensor.
+                padding_shape = (attention_mask.shape[0], attention_mask.shape[1], target_length)
+                padding = torch.zeros(padding_shape, dtype=attention_mask.dtype, device=attention_mask.device)
+                attention_mask = torch.cat([attention_mask, padding], dim=2)
+            else:
+                attention_mask = F.pad(attention_mask, (0, target_length), value=0.0)
+
+        if out_dim == 3:
+            if attention_mask.shape[0] < batch_size * head_size:
+                attention_mask = attention_mask.repeat_interleave(head_size, dim=0)
+        elif out_dim == 4:
+            attention_mask = attention_mask.unsqueeze(1)
+            attention_mask = attention_mask.repeat_interleave(head_size, dim=1)
+
+        return attention_mask
+
+    def norm_encoder_hidden_states(self, encoder_hidden_states: torch.Tensor) -> torch.Tensor:
+        if self.norm_cross is None:
+            raise ValueError("self.norm_cross must be defined to call self.norm_encoder_hidden_states")
+
+        if isinstance(self.norm_cross, nn.LayerNorm):
+            encoder_hidden_states = self.norm_cross(encoder_hidden_states)
+        elif isinstance(self.norm_cross, nn.GroupNorm):
+            # Group norm norms along the channels dimension and expects input to be in the shape of (N, C, *).
+            # In this case, we want to norm along the hidden dimension, so we need to move
+            # the shape (batch_size, sequence_length, hidden_size) to (batch_size, hidden_size, sequence_length)
+            encoder_hidden_states = encoder_hidden_states.transpose(1, 2)
+            encoder_hidden_states = self.norm_cross(encoder_hidden_states)
+            encoder_hidden_states = encoder_hidden_states.transpose(1, 2)
+        else:
+            raise ValueError
+
+        return encoder_hidden_states
+
+
+class AttnProcessor2:
+    r"""
+    Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0).
+    """
+
+    def __init__(self, attention_mode='xformers'):
+        self.attention_mode = attention_mode
+        if not hasattr(F, ATTR_SDPA):
+            raise ImportError("AttnProcessor2 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.FloatTensor,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        temb: Optional[torch.FloatTensor] = None,
+        scale: float = 1.0,
+    ) -> torch.FloatTensor:
+        residual = hidden_states
+
+        args = () if USE_PEFT_BACKEND else (scale,)
+
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+
+        input_ndim = hidden_states.ndim
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+
+        batch_size, sequence_length, _ = (
+            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        )
+
+        if attention_mask is not None:
+            attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
+            # scaled_dot_product_attention expects attention_mask shape to be
+            # the shape like (batch, heads, source_length, target_length)
+            attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1])
+
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
+
+        args = () if USE_PEFT_BACKEND else (scale,)
+        query = attn.to_q(hidden_states, *args)
+
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+        key = attn.to_k(encoder_hidden_states, *args)
+        value = attn.to_v(encoder_hidden_states, *args)
+
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        if self.attention_mode == 'flash':
+            if (attention_mask is not None) and (not torch.all(attention_mask.bool())):
+                raise ValueError("flash-attn do not support attention_mask")
+            with torch.backends.cuda.sdp_kernel(enable_math=False, enable_flash=True, enable_mem_efficient=False):
+                hidden_states = F.scaled_dot_product_attention(
+                    query, key, value, dropout_p=0.0, is_causal=False
+                )
+        elif self.attention_mode == 'xformers':
+            with torch.backends.cuda.sdp_kernel(enable_math=False, enable_flash=False, enable_mem_efficient=True):
+                hidden_states = F.scaled_dot_product_attention(
+                    query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+                )
+        elif self.attention_mode == 'math':
+            hidden_states = F.scaled_dot_product_attention(
+                query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+            )
+        else:
+            raise NotImplementedError(f'Found attention_mode: {self.attention_mode}')
+        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        hidden_states = hidden_states.to(query.dtype)
+
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states, *args)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+
+        hidden_states = hidden_states / attn.rescale_output_factor
+
+        return hidden_states
+
+
+@maybe_allow_in_graph
+class GatedSelfAttentionDense(nn.Module):
+    def __init__(self, query_dim: int, context_dim: int, n_heads: int, d_head: int):
+        super().__init__()
+
+        # we need a linear projection since we need cat visual feature and obj feature
+        self.linear = nn.Linear(context_dim, query_dim)
+
+        self.attn = Attention(query_dim=query_dim, heads=n_heads, dim_head=d_head)
+        self.ff = FeedForward(query_dim, activation_fn="geglu")
+
+        self.norm1 = nn.LayerNorm(query_dim)
+        self.norm2 = nn.LayerNorm(query_dim)
+
+        self.register_parameter("alpha_attn", nn.Parameter(torch.tensor(0.0)))
+        self.register_parameter("alpha_dense", nn.Parameter(torch.tensor(0.0)))
+
+        self.enabled = True
+
+    def forward(self, x: torch.Tensor, objs: torch.Tensor) -> torch.Tensor:
+        if not self.enabled:
+            return x
+
+        n_visual = x.shape[1]
+        objs = self.linear(objs)
+
+        x = x + self.alpha_attn.tanh() * self.attn(self.norm1(torch.cat([x, objs], dim=1)))[:, :n_visual, :]
+        x = x + self.alpha_dense.tanh() * self.ff(self.norm2(x))
+
+        return x
+
+
+class FeedForward(nn.Module):
+    def __init__(
+            self,
+            dim: int,
+            dim_out: Optional[int] = None,
+            mult: int = 4,
+            dropout: float = 0.0,
+            activation_fn: str = "geglu",
+            final_dropout: bool = False,
+    ):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        linear_cls = LoRACompatibleLinear if not USE_PEFT_BACKEND else nn.Linear
+
+        if activation_fn == "gelu":
+            act_fn = GELU(dim, inner_dim)
+        if activation_fn == "gelu-approximate":
+            act_fn = GELU(dim, inner_dim, approximate="tanh")
+        elif activation_fn == "geglu":
+            act_fn = GEGLU(dim, inner_dim)
+        elif activation_fn == "geglu-approximate":
+            act_fn = ApproximateGELU(dim, inner_dim)
+
+        self.net = nn.ModuleList([])
+        # project in
+        self.net.append(act_fn)
+        # project dropout
+        self.net.append(nn.Dropout(dropout))
+        # project out
+        self.net.append(linear_cls(inner_dim, dim_out))
+        # FF as used in Vision Transformer, MLP-Mixer, etc. have a final dropout
+        if final_dropout:
+            self.net.append(nn.Dropout(dropout))
+
+    def forward(self, hidden_states: torch.Tensor, scale: float = 1.0) -> torch.Tensor:
+        compatible_cls = (GEGLU,) if USE_PEFT_BACKEND else (GEGLU, LoRACompatibleLinear)
+        for module in self.net:
+            if isinstance(module, compatible_cls):
+                hidden_states = module(hidden_states, scale)
+            else:
+                hidden_states = module(hidden_states)
+        return hidden_states
+
+
+@maybe_allow_in_graph
+class BasicTransformerBlockTemporal(nn.Module):
+    def __init__(
+            self,
+            dim: int,
+            num_attention_heads: int,
+            attention_head_dim: int,
+            dropout=0.0,
+            cross_attention_dim: Optional[int] = None,
+            activation_fn: str = "geglu",
+            num_embeds_ada_norm: Optional[int] = None,
+            attention_bias: bool = False,
+            only_cross_attention: bool = False,
+            double_self_attention: bool = False,
+            upcast_attention: bool = False,
+            norm_elementwise_affine: bool = True,
+            norm_type: str = "layer_norm",  # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single'
+            norm_eps: float = 1e-5,
+            final_dropout: bool = False,
+            attention_type: str = "default",
+            positional_embeddings: Optional[str] = None,
+            num_positional_embeddings: Optional[int] = None,
+            attention_mode: str = "xformers",
+    ):
+        super().__init__()
+        self.only_cross_attention = only_cross_attention
+
+        self.use_ada_layer_norm_zero = (num_embeds_ada_norm is not None) and norm_type == "ada_norm_zero"
+        self.use_ada_layer_norm = (num_embeds_ada_norm is not None) and norm_type == "ada_norm"
+        self.use_ada_layer_norm_single = norm_type == "ada_norm_single"
+        self.use_layer_norm = norm_type == "layer_norm"
+
+        if norm_type in ("ada_norm", "ada_norm_zero") and num_embeds_ada_norm is None:
+            raise ValueError(
+                f"`norm_type` is set to {norm_type}, but `num_embeds_ada_norm` is not defined. Please make sure to"
+                f" define `num_embeds_ada_norm` if setting `norm_type` to {norm_type}."
+            )
+
+        if positional_embeddings and (num_positional_embeddings is None):
+            raise ValueError(
+                "If `positional_embedding` type is defined, `num_positition_embeddings` must also be defined."
+            )
+
+        if positional_embeddings == "sinusoidal":
+            self.pos_embed = SinusoidalPositionalEmbedding(dim, max_seq_length=num_positional_embeddings)
+        else:
+            self.pos_embed = None
+
+        # Define 3 blocks. Each block has its own normalization layer.
+        # 1. Self-Attn
+        if self.use_ada_layer_norm:
+            self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm)
+        elif self.use_ada_layer_norm_zero:
+            self.norm1 = AdaLayerNormZero(dim, num_embeds_ada_norm)
+        else:
+            self.norm1 = nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps)
+
+        self.attn1 = Attention(
+            query_dim=dim,
+            heads=num_attention_heads,
+            dim_head=attention_head_dim,
+            dropout=dropout,
+            bias=attention_bias,
+            cross_attention_dim=cross_attention_dim if only_cross_attention else None,
+            upcast_attention=upcast_attention,
+            attention_mode=attention_mode
+        )
+
+        # 3. Feed-forward
+        self.norm3 = nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps)
+
+        self.ff = FeedForward(dim, dropout=dropout, activation_fn=activation_fn, final_dropout=final_dropout)
+
+        # 4. Fuser
+        if attention_type == "gated" or attention_type == "gated-text-image":
+            self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim)
+
+        # 5. Scale-shift for PixArt-Alpha.
+        if self.use_ada_layer_norm_single:
+            self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim ** 0.5)
+
+        # let chunk size default to None
+        self._chunk_size = None
+        self._chunk_dim = 0
+
+    def forward(
+            self,
+            hidden_states: torch.FloatTensor,
+            attention_mask: Optional[torch.FloatTensor] = None,
+            encoder_hidden_states: Optional[torch.FloatTensor] = None,
+            encoder_attention_mask: Optional[torch.FloatTensor] = None,
+            timestep: Optional[torch.LongTensor] = None,
+            cross_attention_kwargs: Dict[str, Any] = None,
+            class_labels: Optional[torch.LongTensor] = None,
+    ) -> torch.FloatTensor:
+        # Notice that normalization is always applied before the real computation in the following blocks.
+        # 0. Self-Attention
+        batch_size = hidden_states.shape[0]
+
+        if self.use_ada_layer_norm:
+            norm_hidden_states = self.norm1(hidden_states, timestep)
+        elif self.use_ada_layer_norm_zero:
+            norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
+                hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
+            )
+        elif self.use_layer_norm:
+            norm_hidden_states = self.norm1(hidden_states)
+        elif self.use_ada_layer_norm_single:
+            shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+                    self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
+            ).chunk(6, dim=1)
+            norm_hidden_states = self.norm1(hidden_states)
+            norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
+            norm_hidden_states = norm_hidden_states.squeeze(1)
+        else:
+            raise ValueError("Incorrect norm used")
+
+        if self.pos_embed is not None:
+            norm_hidden_states = self.pos_embed(norm_hidden_states)
+
+        # 1. Retrieve lora scale.
+        lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0
+
+        # 2. Prepare GLIGEN inputs
+        cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
+        gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
+
+        attn_output = self.attn1(
+            norm_hidden_states,
+            encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
+            attention_mask=attention_mask,
+            **cross_attention_kwargs,
+        )
+        if self.use_ada_layer_norm_zero:
+            attn_output = gate_msa.unsqueeze(1) * attn_output
+        elif self.use_ada_layer_norm_single:
+            attn_output = gate_msa * attn_output
+
+        hidden_states = attn_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+
+        # 2.5 GLIGEN Control
+        if gligen_kwargs is not None:
+            hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
+
+        # 4. Feed-forward
+        if self.use_ada_layer_norm_zero:
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+
+        if self.use_ada_layer_norm_single:
+            norm_hidden_states = self.norm3(hidden_states)
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
+
+        if self._chunk_size is not None:
+            # "feed_forward_chunk_size" can be used to save memory
+            if norm_hidden_states.shape[self._chunk_dim] % self._chunk_size != 0:
+                raise ValueError(
+                    f"`hidden_states` dimension to be chunked: {norm_hidden_states.shape[self._chunk_dim]} "
+                    f"has to be divisible by chunk size: {self._chunk_size}. Make sure to set an appropriate "
+                    f"`chunk_size` when calling `unet.enable_forward_chunking`."
+                )
+
+            num_chunks = norm_hidden_states.shape[self._chunk_dim] // self._chunk_size
+            ff_output = torch.cat(
+                [
+                    self.ff(hid_slice, scale=lora_scale)
+                    for hid_slice in norm_hidden_states.chunk(num_chunks, dim=self._chunk_dim)
+                ],
+                dim=self._chunk_dim,
+            )
+        else:
+            ff_output = self.ff(norm_hidden_states, scale=lora_scale)
+
+        if self.use_ada_layer_norm_zero:
+            ff_output = gate_mlp.unsqueeze(1) * ff_output
+        elif self.use_ada_layer_norm_single:
+            ff_output = gate_mlp * ff_output
+
+        hidden_states = ff_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+
+        return hidden_states
+
+
+@maybe_allow_in_graph
+class BasicTransformerBlock(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_attention_heads: int,
+        attention_head_dim: int,
+        dropout=0.0,
+        cross_attention_dim: Optional[int] = None,
+        activation_fn: str = "geglu",
+        num_embeds_ada_norm: Optional[int] = None,
+        attention_bias: bool = False,
+        only_cross_attention: bool = False,
+        double_self_attention: bool = False,
+        upcast_attention: bool = False,
+        norm_elementwise_affine: bool = True,
+        norm_type: str = "layer_norm",  # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single'
+        norm_eps: float = 1e-5,
+        final_dropout: bool = False,
+        attention_type: str = "default",
+        positional_embeddings: Optional[str] = None,
+        num_positional_embeddings: Optional[int] = None,
+        attention_mode: str = "xformers"
+    ):
+        super().__init__()
+        self.only_cross_attention = only_cross_attention
+
+        self.use_ada_layer_norm_zero = (num_embeds_ada_norm is not None) and norm_type == "ada_norm_zero"
+        self.use_ada_layer_norm = (num_embeds_ada_norm is not None) and norm_type == "ada_norm"
+        self.use_ada_layer_norm_single = norm_type == "ada_norm_single"
+        self.use_layer_norm = norm_type == "layer_norm"
+
+        if norm_type in ("ada_norm", "ada_norm_zero") and num_embeds_ada_norm is None:
+            raise ValueError(
+                f"`norm_type` is set to {norm_type}, but `num_embeds_ada_norm` is not defined. Please make sure to"
+                f" define `num_embeds_ada_norm` if setting `norm_type` to {norm_type}."
+            )
+
+        if positional_embeddings and (num_positional_embeddings is None):
+            raise ValueError(
+                "If `positional_embedding` type is defined, `num_positition_embeddings` must also be defined."
+            )
+
+        if positional_embeddings == "sinusoidal":
+            self.pos_embed = SinusoidalPositionalEmbedding(dim, max_seq_length=num_positional_embeddings)
+        else:
+            self.pos_embed = None
+
+        # Define 3 blocks. Each block has its own normalization layer.
+        # 1. Self-Attn
+        if self.use_ada_layer_norm:
+            self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm)
+        elif self.use_ada_layer_norm_zero:
+            self.norm1 = AdaLayerNormZero(dim, num_embeds_ada_norm)
+        else:
+            self.norm1 = nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps)
+
+        self.attn1 = Attention(
+            query_dim=dim,
+            heads=num_attention_heads,
+            dim_head=attention_head_dim,
+            dropout=dropout,
+            bias=attention_bias,
+            cross_attention_dim=cross_attention_dim if only_cross_attention else None,
+            upcast_attention=upcast_attention,
+            attention_mode=attention_mode
+        )
+
+        # 2. Cross-Attn
+        if cross_attention_dim is not None or double_self_attention:
+            # We currently only use AdaLayerNormZero for self attention where there will only be one attention block.
+            # I.e. the number of returned modulation chunks from AdaLayerZero would not make sense if returned during
+            # the second cross attention block.
+            self.norm2 = (
+                AdaLayerNorm(dim, num_embeds_ada_norm)
+                if self.use_ada_layer_norm
+                else nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps)
+            )
+            self.attn2 = Attention(
+                query_dim=dim,
+                cross_attention_dim=cross_attention_dim if not double_self_attention else None,
+                heads=num_attention_heads,
+                dim_head=attention_head_dim,
+                dropout=dropout,
+                bias=attention_bias,
+                upcast_attention=upcast_attention,
+                attention_mode='xformers',  # only xformers support attention_mask
+            )  # is self-attn if encoder_hidden_states is none
+        else:
+            self.norm2 = None
+            self.attn2 = None
+
+        # 3. Feed-forward
+        if not self.use_ada_layer_norm_single:
+            self.norm3 = nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine, eps=norm_eps)
+
+        self.ff = FeedForward(
+            dim,
+            dropout=dropout,
+            activation_fn=activation_fn,
+            final_dropout=final_dropout,
+        )
+
+        # 4. Fuser
+        if attention_type == "gated" or attention_type == "gated-text-image":
+            self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim)
+
+        # 5. Scale-shift for PixArt-Alpha.
+        if self.use_ada_layer_norm_single:
+            self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim**0.5)
+
+        # let chunk size default to None
+        self._chunk_size = None
+        self._chunk_dim = 0
+
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        timestep: Optional[torch.LongTensor] = None,
+        cross_attention_kwargs: Dict[str, Any] = None,
+        class_labels: Optional[torch.LongTensor] = None,
+    ) -> torch.FloatTensor:
+        # Notice that normalization is always applied before the real computation in the following blocks.
+        # 0. Self-Attention
+        batch_size = hidden_states.shape[0]
+
+        if self.use_ada_layer_norm:
+            norm_hidden_states = self.norm1(hidden_states, timestep)
+        elif self.use_ada_layer_norm_zero:
+            norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
+                hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
+            )
+        elif self.use_layer_norm:
+            norm_hidden_states = self.norm1(hidden_states)
+        elif self.use_ada_layer_norm_single:
+            shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+                self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
+            ).chunk(6, dim=1)
+            norm_hidden_states = self.norm1(hidden_states)
+            norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
+            norm_hidden_states = norm_hidden_states.squeeze(1)
+        else:
+            raise ValueError("Incorrect norm used")
+
+        if self.pos_embed is not None:
+            norm_hidden_states = self.pos_embed(norm_hidden_states)
+
+        # 1. Retrieve lora scale.
+        lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0
+
+        # 2. Prepare GLIGEN inputs
+        cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
+        gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
+
+        attn_output = self.attn1(
+            norm_hidden_states,
+            encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
+            attention_mask=attention_mask,
+            **cross_attention_kwargs,
+        )
+        if self.use_ada_layer_norm_zero:
+            attn_output = gate_msa.unsqueeze(1) * attn_output
+        elif self.use_ada_layer_norm_single:
+            attn_output = gate_msa * attn_output
+
+        hidden_states = attn_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+
+        # 2.5 GLIGEN Control
+        if gligen_kwargs is not None:
+            hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
+
+        # 3. Cross-Attention
+        if self.attn2 is not None:
+            if self.use_ada_layer_norm:
+                norm_hidden_states = self.norm2(hidden_states, timestep)
+            elif self.use_ada_layer_norm_zero or self.use_layer_norm:
+                norm_hidden_states = self.norm2(hidden_states)
+            elif self.use_ada_layer_norm_single:
+                norm_hidden_states = hidden_states
+            else:
+                raise ValueError("Incorrect norm")
+
+            if self.pos_embed is not None and self.use_ada_layer_norm_single is False:
+                norm_hidden_states = self.pos_embed(norm_hidden_states)
+
+            attn_output = self.attn2(
+                norm_hidden_states,
+                encoder_hidden_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                **cross_attention_kwargs,
+            )
+            hidden_states = attn_output + hidden_states
+
+        # 4. Feed-forward
+        if not self.use_ada_layer_norm_single:
+            norm_hidden_states = self.norm3(hidden_states)
+
+        if self.use_ada_layer_norm_zero:
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+
+        if self.use_ada_layer_norm_single:
+            norm_hidden_states = self.norm2(hidden_states)
+            norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
+
+        if self._chunk_size is not None:
+            # "feed_forward_chunk_size" can be used to save memory
+            ff_output = _chunked_feed_forward(
+                self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size, lora_scale=lora_scale
+            )
+        else:
+            ff_output = self.ff(norm_hidden_states, scale=lora_scale)
+
+        if self.use_ada_layer_norm_zero:
+            ff_output = gate_mlp.unsqueeze(1) * ff_output
+        elif self.use_ada_layer_norm_single:
+            ff_output = gate_mlp * ff_output
+
+        hidden_states = ff_output + hidden_states
+        if hidden_states.ndim == 4:
+            hidden_states = hidden_states.squeeze(1)
+
+        return hidden_states
+
+
+class CombinedTimestepSizeEmbeddings(nn.Module):
+    def __init__(self, embedding_dim, size_emb_dim, use_additional_conditions: bool = False):
+        super().__init__()
+
+        self.outdim = size_emb_dim
+        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
+        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
+
+        self.use_additional_conditions = use_additional_conditions
+        if use_additional_conditions:
+            self.use_additional_conditions = True
+            self.additional_condition_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
+            self.resolution_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=size_emb_dim)
+            self.aspect_ratio_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=size_emb_dim)
+
+    def apply_condition(self, size: torch.Tensor, batch_size: int, embedder: nn.Module):
+        if size.ndim == 1:
+            size = size[:, None]
+
+        if size.shape[0] != batch_size:
+            size = size.repeat(batch_size // size.shape[0], 1)
+            if size.shape[0] != batch_size:
+                raise ValueError(f"`batch_size` should be {size.shape[0]} but found {batch_size}.")
+
+        current_batch_size, dims = size.shape[0], size.shape[1]
+        size = size.reshape(-1)
+        size_freq = self.additional_condition_proj(size).to(size.dtype)
+
+        size_emb = embedder(size_freq)
+        size_emb = size_emb.reshape(current_batch_size, dims * self.outdim)
+        return size_emb
+
+    def forward(self, timestep, resolution, aspect_ratio, batch_size, hidden_dtype):
+        timesteps_proj = self.time_proj(timestep)
+        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_dtype))  # (N, D)
+
+        if self.use_additional_conditions:
+            resolution = self.apply_condition(resolution, batch_size=batch_size, embedder=self.resolution_embedder)
+            aspect_ratio = self.apply_condition(
+                aspect_ratio, batch_size=batch_size, embedder=self.aspect_ratio_embedder
+            )
+            conditioning = timesteps_emb + torch.cat([resolution, aspect_ratio], dim=1)
+        else:
+            conditioning = timesteps_emb
+
+        return conditioning
+
+
+class CaptionProjection(nn.Module):
+    def __init__(self, in_features, hidden_size, num_tokens=120):
+        super().__init__()
+        self.linear_1 = nn.Linear(in_features=in_features, out_features=hidden_size, bias=True)
+        self.act_1 = nn.GELU(approximate="tanh")
+        self.linear_2 = nn.Linear(in_features=hidden_size, out_features=hidden_size, bias=True)
+        self.register_buffer("y_embedding", nn.Parameter(torch.randn(num_tokens, in_features) / in_features**0.5))
+
+    def forward(self, caption, force_drop_ids=None):
+        hidden_states = self.linear_1(caption)
+        hidden_states = self.act_1(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        return hidden_states
+
+
+class PatchEmbed(nn.Module):
+    """2D Image to Patch Embedding"""
+    def __init__(
+        self,
+        height=224,
+        width=224,
+        patch_size=16,
+        in_channels=3,
+        embed_dim=768,
+        layer_norm=False,
+        flatten=True,
+        bias=True,
+        interpolation_scale=1,
+    ):
+        super().__init__()
+
+        num_patches = (height // patch_size) * (width // patch_size)
+        self.flatten = flatten
+        self.layer_norm = layer_norm
+
+        self.proj = nn.Conv2d(
+            in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
+        )
+        if layer_norm:
+            self.norm = nn.LayerNorm(embed_dim, elementwise_affine=False, eps=1e-6)
+        else:
+            self.norm = None
+
+        self.patch_size = patch_size
+        self.height, self.width = height // patch_size, width // patch_size
+        self.base_size = height // patch_size
+        self.interpolation_scale = interpolation_scale
+        pos_embed = get_2d_sincos_pos_embed(
+            embed_dim, int(num_patches**0.5), base_size=self.base_size, interpolation_scale=self.interpolation_scale
+        )
+        self.register_buffer("pos_embed", torch.from_numpy(pos_embed).float().unsqueeze(0), persistent=False)
+
+    def forward(self, latent):
+        height, width = latent.shape[-2] // self.patch_size, latent.shape[-1] // self.patch_size
+
+        latent = self.proj(latent)
+        if self.flatten:
+            latent = latent.flatten(2).transpose(1, 2)  # BCHW -> BNC
+        if self.layer_norm:
+            latent = self.norm(latent)
+
+        if self.height != height or self.width != width:
+            pos_embed = get_2d_sincos_pos_embed(
+                embed_dim=self.pos_embed.shape[-1],
+                grid_size=(height, width),
+                base_size=self.base_size,
+                interpolation_scale=self.interpolation_scale,
+            )
+            pos_embed = torch.from_numpy(pos_embed)
+            pos_embed = pos_embed.float().unsqueeze(0).to(latent.device)
+        else:
+            pos_embed = self.pos_embed
+
+        return (latent + pos_embed).to(latent.dtype)
+
+
+class AdaLayerNormSingle(nn.Module):
+    def __init__(self, embedding_dim: int, use_additional_conditions: bool = False):
+        super().__init__()
+
+        self.emb = CombinedTimestepSizeEmbeddings(
+            embedding_dim, size_emb_dim=embedding_dim // 3, use_additional_conditions=use_additional_conditions
+        )
+
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(embedding_dim, 6 * embedding_dim, bias=True)
+
+    def forward(
+            self,
+            timestep: torch.Tensor,
+            added_cond_kwargs: Dict[str, torch.Tensor] = None,
+            batch_size: int = None,
+            hidden_dtype: Optional[torch.dtype] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        # No modulation happening here.
+        embedded_timestep = self.emb(timestep, batch_size=batch_size, hidden_dtype=hidden_dtype, resolution=None,
+                                     aspect_ratio=None)
+        return self.linear(self.silu(embedded_timestep)), embedded_timestep
+
+
+@dataclass
+class Transformer3DModelOutput(BaseOutput):
+    sample: torch.FloatTensor
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/modeling_latte.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/modeling_latte.py
new file mode 100644
index 0000000000000000000000000000000000000000..139011a75705fc7b2646db3029cb8c1adbb80e5b
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/latte/modeling_latte.py
@@ -0,0 +1,556 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from typing import Any, Dict, Optional
+from dataclasses import dataclass
+from einops import rearrange, repeat
+import torch
+import torch.nn.functional as F
+from torch import nn
+from diffusers.utils import USE_PEFT_BACKEND, deprecate
+from diffusers.models.embeddings import ImagePositionalEmbeddings
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.models.lora import LoRACompatibleConv, LoRACompatibleLinear
+
+from opensoraplan.layers.utils import get_1d_sincos_pos_embed
+from opensoraplan.utils.log import logger
+from opensoraplan.models.parallel_mgr import (
+    get_sequence_parallel_group,
+    use_sequence_parallel
+)
+from opensoraplan.models.comm import (
+    all_to_all_with_pad,
+    gather_sequence,
+    get_spatial_pad,
+    get_temporal_pad,
+    set_spatial_pad,
+    set_temporal_pad,
+    split_sequence,
+)
+from opensoraplan.acceleration.dit_cache_common import CacheConfig
+from opensoraplan.acceleration.open_sora_plan_dit_cache import OpenSoraPlanDiTCacheManager
+
+from .latte_modules import PatchEmbed, BasicTransformerBlock, BasicTransformerBlockTemporal, AdaLayerNormSingle, \
+    Transformer3DModelOutput, CaptionProjection
+
+ADA_NORM_SINGLE = "ada_norm_single"
+SLICE_TEMPORAL_PATTERN = '(b T) S d -> b T S d'
+CHANGE_TF_PATTERN = '(b t) f d -> (b f) t d'
+
+
+@dataclass
+class LatteParams:
+    hidden_states: torch.Tensor
+    timestep: Optional[torch.LongTensor] = None
+    encoder_hidden_states: Optional[torch.Tensor] = None
+    added_cond_kwargs: Dict[str, torch.Tensor] = None
+    enable_temporal_attentions: bool = True
+    class_labels: Optional[torch.LongTensor] = None
+    cross_attention_kwargs: Dict[str, Any] = None
+    attention_mask: Optional[torch.Tensor] = None
+    encoder_attention_mask: Optional[torch.Tensor] = None
+    use_image_num: int = 0
+    return_dict: bool = False
+
+
+class LatteT2V(ModelMixin, ConfigMixin):
+    _supports_gradient_checkpointing = True
+
+    """
+    A 2D Transformer model for image-like data.
+
+    Parameters:
+        num_attention_heads (`int`, *optional*, defaults to 16): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`, *optional*, defaults to 88): The number of channels in each head.
+        in_channels (`int`, *optional*):
+            The number of channels in the input and output (specify if the input is **continuous**).
+        num_layers (`int`, *optional*, defaults to 1): The number of layers of Transformer blocks to use.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        cross_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use.
+        sample_size (`int`, *optional*): The width of the latent images (specify if the input is **discrete**).
+            This is fixed during training since it is used to learn a number of position embeddings.
+        num_vector_embeds (`int`, *optional*):
+            The number of classes of the vector embeddings of the latent pixels (specify if the input is **discrete**).
+            Includes the class for the masked latent pixel.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to use in feed-forward.
+        num_embeds_ada_norm ( `int`, *optional*):
+            The number of diffusion steps used during training. Pass if at least one of the norm_layers is
+            `AdaLayerNorm`. This is fixed during training since it is used to learn a number of embeddings that are
+            added to the hidden states.
+
+            During inference, you can denoise for up to but not more steps than `num_embeds_ada_norm`.
+        attention_bias (`bool`, *optional*):
+            Configure if the `TransformerBlocks` attention should contain a bias parameter.
+    """
+
+    @register_to_config
+    def __init__(
+            self,
+            num_attention_heads: int = 16,
+            patch_size_t: int = 1,
+            attention_head_dim: int = 88,
+            in_channels: Optional[int] = None,
+            out_channels: Optional[int] = None,
+            num_layers: int = 1,
+            dropout: float = 0.0,
+            norm_num_groups: int = 32,
+            cross_attention_dim: Optional[int] = None,
+            attention_bias: bool = False,
+            sample_size: Optional[int] = None,
+            num_vector_embeds: Optional[int] = None,
+            patch_size: Optional[int] = None,
+            activation_fn: str = "geglu",
+            num_embeds_ada_norm: Optional[int] = None,
+            use_linear_projection: bool = False,
+            only_cross_attention: bool = False,
+            double_self_attention: bool = False,
+            upcast_attention: bool = False,
+            norm_type: str = "layer_norm",
+            norm_elementwise_affine: bool = True,
+            norm_eps: float = 1e-5,
+            attention_type: str = "default",
+            caption_channels: int = None,
+            video_length: int = 17,
+            attention_mode: str = 'flash'
+    ):
+        super().__init__()
+        self.use_linear_projection = use_linear_projection
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_dim = attention_head_dim
+        inner_dim = num_attention_heads * attention_head_dim
+        self.video_length = video_length
+
+        conv_cls = nn.Conv2d if USE_PEFT_BACKEND else LoRACompatibleConv
+        linear_cls = nn.Linear if USE_PEFT_BACKEND else LoRACompatibleLinear
+
+        # 1. Transformer2DModel can process both standard continuous images of shape `(batch_size, num_channels,
+        # width, height)` as well as quantized image embeddings of shape `(batch_size, num_image_vectors)`
+        # Define whether input is continuous or discrete depending on configuration
+        self.is_input_continuous = (in_channels is not None) and (patch_size is None)
+        self.is_input_vectorized = num_vector_embeds is not None
+        self.is_input_patches = in_channels is not None and patch_size is not None
+        self.cache_manager = OpenSoraPlanDiTCacheManager(CacheConfig())
+
+        if norm_type == "layer_norm" and num_embeds_ada_norm is not None:
+            deprecation_message = (
+                f"The configuration file of this model: {self.__class__} is outdated. `norm_type` is either not set or"
+                " incorrectly set to `'layer_norm'`.Make sure to set `norm_type` to `'ada_norm'` in the config."
+                " Please make sure to update the config accordingly as leaving `norm_type` might led to incorrect"
+                " results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it"
+                " would be very nice if you could open a Pull request for the `transformer/config.json` file"
+            )
+            deprecate("norm_type!=num_embeds_ada_norm", "1.0.0", deprecation_message, standard_warn=False)
+            norm_type = "ada_norm"
+
+        if self.is_input_continuous and self.is_input_vectorized:
+            raise ValueError(
+                f"Cannot define both `in_channels`: {in_channels} and `num_vector_embeds`: {num_vector_embeds}. Make"
+                " sure that either `in_channels` or `num_vector_embeds` is None."
+            )
+        elif self.is_input_vectorized and self.is_input_patches:
+            raise ValueError(
+                f"Cannot define both `num_vector_embeds`: {num_vector_embeds} and `patch_size`: {patch_size}. Make"
+                " sure that either `num_vector_embeds` or `num_patches` is None."
+            )
+        elif not self.is_input_continuous and not self.is_input_vectorized and not self.is_input_patches:
+            raise ValueError(
+                f"Has to define `in_channels`: {in_channels}, `num_vector_embeds`: {num_vector_embeds}, or patch_size:"
+                f" {patch_size}. Make sure that `in_channels`, `num_vector_embeds` or `num_patches` is not None."
+            )
+
+        # 2. Define input layers
+        if self.is_input_continuous:
+            self.in_channels = in_channels
+
+            self.norm = torch.nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+            if use_linear_projection:
+                self.proj_in = linear_cls(in_channels, inner_dim)
+            else:
+                self.proj_in = conv_cls(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
+        elif self.is_input_vectorized:
+            if sample_size is None or num_vector_embeds is None:
+                logger.error("Transformer2DModel over discrete input must provide sample_size and num_embed")
+                raise ValueError
+
+            self.height = sample_size[0]
+            self.width = sample_size[1]
+            self.num_vector_embeds = num_vector_embeds
+            self.num_latent_pixels = self.height * self.width
+
+            self.latent_image_embedding = ImagePositionalEmbeddings(
+                num_embed=num_vector_embeds, embed_dim=inner_dim, height=self.height, width=self.width
+            )
+        elif self.is_input_patches:
+            if sample_size is None:
+                logger.error("Transformer2DModel over patched input must provide sample_size")
+                raise ValueError
+
+            self.height = sample_size[0]
+            self.width = sample_size[1]
+
+            self.patch_size = patch_size
+            interpolation_scale = self.config.sample_size[0] // 64  # => 64 (= 512 pixart) has interpolation scale 1
+            interpolation_scale = max(interpolation_scale, 1)
+            self.pos_embed = PatchEmbed(
+                height=sample_size[0],
+                width=sample_size[1],
+                patch_size=patch_size,
+                in_channels=in_channels,
+                embed_dim=inner_dim,
+                interpolation_scale=interpolation_scale,
+            )
+
+        # 3. Define transformers blocks, spatial attention
+        self.transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlock(
+                    inner_dim,
+                    num_attention_heads,
+                    attention_head_dim,
+                    dropout=dropout,
+                    cross_attention_dim=cross_attention_dim,
+                    activation_fn=activation_fn,
+                    num_embeds_ada_norm=num_embeds_ada_norm,
+                    attention_bias=attention_bias,
+                    only_cross_attention=only_cross_attention,
+                    double_self_attention=double_self_attention,
+                    upcast_attention=upcast_attention,
+                    norm_type=norm_type,
+                    norm_elementwise_affine=norm_elementwise_affine,
+                    norm_eps=norm_eps,
+                    attention_type=attention_type,
+                    attention_mode=attention_mode
+                )
+                for d in range(num_layers)
+            ]
+        )
+
+        # Define temporal transformers blocks
+        self.temporal_transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlockTemporal(  # one attention
+                    inner_dim,
+                    num_attention_heads,  # num_attention_heads
+                    attention_head_dim,  # attention_head_dim 72
+                    dropout=dropout,
+                    cross_attention_dim=None,
+                    activation_fn=activation_fn,
+                    num_embeds_ada_norm=num_embeds_ada_norm,
+                    attention_bias=attention_bias,
+                    only_cross_attention=only_cross_attention,
+                    double_self_attention=False,
+                    upcast_attention=upcast_attention,
+                    norm_type=norm_type,
+                    norm_elementwise_affine=norm_elementwise_affine,
+                    norm_eps=norm_eps,
+                    attention_type=attention_type,
+                    attention_mode=attention_mode
+                )
+                for d in range(num_layers)
+            ]
+        )
+
+        # 4. Define output layers
+        self.out_channels = in_channels if out_channels is None else out_channels
+        if self.is_input_continuous:
+            if use_linear_projection:
+                self.proj_out = linear_cls(inner_dim, in_channels)
+            else:
+                self.proj_out = conv_cls(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
+        elif self.is_input_vectorized:
+            self.norm_out = nn.LayerNorm(inner_dim)
+            self.out = nn.Linear(inner_dim, self.num_vector_embeds - 1)
+        elif self.is_input_patches and norm_type != ADA_NORM_SINGLE:
+            self.norm_out = nn.LayerNorm(inner_dim, elementwise_affine=False, eps=1e-6)
+            self.proj_out_1 = nn.Linear(inner_dim, 2 * inner_dim)
+            self.proj_out_2 = nn.Linear(inner_dim, patch_size * patch_size * self.out_channels)
+        elif self.is_input_patches and norm_type == ADA_NORM_SINGLE:
+            self.norm_out = nn.LayerNorm(inner_dim, elementwise_affine=False, eps=1e-6)
+            self.scale_shift_table = nn.Parameter(torch.randn(2, inner_dim) / inner_dim ** 0.5)
+            self.proj_out = nn.Linear(inner_dim, patch_size * patch_size * self.out_channels)
+
+        # 5. PixArt-Alpha blocks.
+        self.adaln_single = None
+        self.use_additional_conditions = False
+        if norm_type == ADA_NORM_SINGLE:
+            # additional conditions until we find better name
+            self.adaln_single = AdaLayerNormSingle(inner_dim, use_additional_conditions=self.use_additional_conditions)
+
+        self.caption_projection = None
+        if caption_channels is not None:
+            self.caption_projection = CaptionProjection(in_features=caption_channels, hidden_size=inner_dim)
+
+        self.gradient_checkpointing = False
+
+        interpolation_scale = self.config.video_length // 5  # => 5 (= 5 our causalvideovae) has interpolation scale 1
+        interpolation_scale = max(interpolation_scale, 1)
+        temp_pos_embed = get_1d_sincos_pos_embed(inner_dim, video_length, interpolation_scale=interpolation_scale)
+        self.register_buffer("temp_pos_embed", torch.from_numpy(temp_pos_embed).float().unsqueeze(0), persistent=False)
+
+    def forward(
+            self,
+            latte_params: LatteParams,
+            t_idx: torch.Tensor = 0,
+    ):
+        hidden_states = latte_params.hidden_states
+        timestep = latte_params.timestep
+        encoder_hidden_states = latte_params.encoder_hidden_states
+        added_cond_kwargs = latte_params.added_cond_kwargs
+        enable_temporal_attentions = latte_params.enable_temporal_attentions
+        class_labels = latte_params.class_labels
+        cross_attention_kwargs = latte_params.cross_attention_kwargs
+        attention_mask = latte_params.attention_mask
+        encoder_attention_mask = latte_params.encoder_attention_mask
+        use_image_num = latte_params.use_image_num
+        return_dict = latte_params.return_dict
+
+        input_batch_size, c, frame, h, w = hidden_states.shape
+        frame = frame - use_image_num
+        hidden_states = rearrange(hidden_states, 'b c f h w -> (b f) c h w').contiguous()
+        # ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
+        #   we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
+        #   we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
+        # expects mask of shape: [batch, key_tokens]
+        # adds singleton query_tokens dimension:[batch, 1, key_tokens]
+        # this helps to broadcast it as a bias over attention scores, which will be in one of the following shapes:
+        #   [batch,  heads, query_tokens, key_tokens] (e.g. torch sdp attn)
+        #   [batch * heads, query_tokens, key_tokens] (e.g. xformers or classic attn)
+        if attention_mask is not None and attention_mask.ndim == 2:
+            # assume that mask is expressed as:
+            #   (1 = keep, 0 = discard)
+            # convert mask into a bias that can be added to attention scores:
+            #       (keep = +0, discard = -10000.0)
+            attention_mask = (1 - attention_mask.to(hidden_states.dtype)) * -10000.0
+            attention_mask = attention_mask.unsqueeze(1)
+            attention_mask = attention_mask.to(self.dtype)
+        # 1 + 4, 1 -> video condition, 4 -> image condition
+        # convert encoder_attention_mask to a bias the same way we do for attention_mask
+        if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:  # ndim == 2 means no image joint
+            encoder_attention_mask = (1 - encoder_attention_mask.to(hidden_states.dtype)) * -10000.0
+            encoder_attention_mask = encoder_attention_mask.unsqueeze(1)
+            encoder_attention_mask = repeat(encoder_attention_mask, 'b 1 l -> (b f) 1 l', f=frame).contiguous()
+            encoder_attention_mask = encoder_attention_mask.to(self.dtype)
+        elif encoder_attention_mask is not None and encoder_attention_mask.ndim == 3:  # ndim == 3 means image joint
+            encoder_attention_mask = (1 - encoder_attention_mask.to(hidden_states.dtype)) * -10000.0
+            encoder_attention_mask_video = encoder_attention_mask[:, :1, ...]
+            encoder_attention_mask_video = repeat(encoder_attention_mask_video, 'b 1 l -> b (1 f) l',
+                                                  f=frame).contiguous()
+            encoder_attention_mask_image = encoder_attention_mask[:, 1:, ...]
+            encoder_attention_mask = torch.cat([encoder_attention_mask_video, encoder_attention_mask_image], dim=1)
+            encoder_attention_mask = rearrange(encoder_attention_mask, 'b n l -> (b n) l').contiguous().unsqueeze(1)
+            encoder_attention_mask = encoder_attention_mask.to(self.dtype)
+
+        # Retrieve lora scale.
+        lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0
+
+        # 1. Input
+        if self.is_input_patches:  # here
+            height, width = hidden_states.shape[-2] // self.patch_size, hidden_states.shape[-1] // self.patch_size
+            num_patches = height * width
+
+            hidden_states = self.pos_embed(hidden_states.to(self.dtype))  # alrady add positional embeddings
+
+            if self.adaln_single is not None:
+                if self.use_additional_conditions and added_cond_kwargs is None:
+                    raise ValueError(
+                        "`added_cond_kwargs` cannot be None when using additional conditions for `adaln_single`."
+                    )
+                batch_size = input_batch_size
+                timestep, embedded_timestep = self.adaln_single(
+                    timestep, added_cond_kwargs, batch_size=batch_size, hidden_dtype=hidden_states.dtype
+                )
+
+        t_dim = frame + use_image_num
+        s_dim = num_patches
+        # shard over the sequence dim if sp is enabled
+        if use_sequence_parallel():
+            set_temporal_pad(t_dim)
+            set_spatial_pad(s_dim)
+            hidden_states = rearrange(hidden_states, SLICE_TEMPORAL_PATTERN, T=t_dim, S=s_dim).contiguous()
+            hidden_states = split_sequence(hidden_states, get_sequence_parallel_group(), dim=1, pad=get_temporal_pad())
+            t_dim = hidden_states.shape[1]
+            hidden_states = rearrange(hidden_states, 'b T S d -> (b T) S d', T=t_dim, S=s_dim).contiguous()
+
+        # 2. Blocks
+        if self.caption_projection is not None:
+            batch_size = hidden_states.shape[0]
+            encoder_hidden_states = self.caption_projection(encoder_hidden_states.to(self.dtype))  # 3 120 1152
+
+            if use_image_num != 0 and self.training:
+                encoder_hidden_states_video = encoder_hidden_states[:, :1, ...]
+                encoder_hidden_states_video = repeat(encoder_hidden_states_video, 'b 1 t d -> b (1 f) t d',
+                                                     f=frame).contiguous()
+                encoder_hidden_states_image = encoder_hidden_states[:, 1:, ...]
+                encoder_hidden_states = torch.cat([encoder_hidden_states_video, encoder_hidden_states_image], dim=1)
+                encoder_hidden_states_spatial = rearrange(encoder_hidden_states, 'b f t d -> (b f) t d').contiguous()
+            else:
+                encoder_hidden_states_spatial = repeat(encoder_hidden_states, 'b t d -> (b f) t d',
+                                                       f=t_dim).contiguous()
+
+        # prepare timesteps for spatial and temporal block
+        timestep_spatial = repeat(timestep, 'b d -> (b f) d', f=t_dim).contiguous()
+        timestep_temp = repeat(timestep, 'b d -> (b p) d', p=num_patches).contiguous()
+
+        if self.training:
+            for i, (spatial_block, temp_block) in enumerate(zip(self.transformer_blocks,
+                                                                self.temporal_transformer_blocks)):
+                if self.gradient_checkpointing:
+                    hidden_states = torch.utils.checkpoint.checkpoint(
+                        spatial_block,
+                        hidden_states,
+                        attention_mask,
+                        encoder_hidden_states_spatial,
+                        encoder_attention_mask,
+                        timestep_spatial,
+                        cross_attention_kwargs,
+                        class_labels,
+                        use_reentrant=False,
+                    )
+
+                    if enable_temporal_attentions:
+                        hidden_states = rearrange(hidden_states,
+                                                  '(b f) t d -> (b t) f d',
+                                                  b=input_batch_size).contiguous()
+
+                        if use_image_num != 0:  # image-video join training
+                            hidden_states_video = hidden_states[:, :frame, ...]
+                            hidden_states_image = hidden_states[:, frame:, ...]
+
+                            if i == 0:
+                                hidden_states_video = hidden_states_video + self.temp_pos_embed
+
+                            hidden_states_video = torch.utils.checkpoint.checkpoint(
+                                temp_block,
+                                hidden_states_video,
+                                None,  # attention_mask
+                                None,  # encoder_hidden_states
+                                None,  # encoder_attention_mask
+                                timestep_temp,
+                                cross_attention_kwargs,
+                                class_labels,
+                                use_reentrant=False,
+                            )
+
+                            hidden_states = torch.cat([hidden_states_video, hidden_states_image], dim=1)
+                            hidden_states = rearrange(hidden_states, CHANGE_TF_PATTERN,
+                                                      b=input_batch_size).contiguous()
+
+                        else:
+                            if i == 0:
+                                hidden_states = hidden_states + self.temp_pos_embed
+
+                            hidden_states = torch.utils.checkpoint.checkpoint(
+                                temp_block,
+                                hidden_states,
+                                None,  # attention_mask
+                                None,  # encoder_hidden_states
+                                None,  # encoder_attention_mask
+                                timestep_temp,
+                                cross_attention_kwargs,
+                                class_labels,
+                                use_reentrant=False,
+                            )
+
+                            hidden_states = rearrange(hidden_states, CHANGE_TF_PATTERN,
+                                                      b=input_batch_size).contiguous()
+        else:
+            block_list = [self.transformer_blocks, self.temporal_transformer_blocks]
+            self.cache_manager.temp_pos_embed = self.temp_pos_embed
+            hidden_states = self.cache_manager(t_idx, block_list, hidden_states,
+                                               attention_mask=attention_mask,
+                                               encoder_hidden_states_spatial=encoder_hidden_states_spatial,
+                                               encoder_attention_mask=encoder_attention_mask,
+                                               timestep_spatial=timestep_spatial,
+                                               timestep_temp=timestep_temp,
+                                               cross_attention_kwargs=cross_attention_kwargs,
+                                               class_labels=class_labels,
+                                               input_batch_size=input_batch_size,
+                                               enable_temporal_attentions=enable_temporal_attentions,
+                                               t_dim=t_dim,
+                                               s_dim=s_dim,
+                                               timestep=timestep)
+
+        if use_sequence_parallel():
+            hidden_states = rearrange(hidden_states, "(B T) S C -> B T S C", B=input_batch_size, T=t_dim, S=s_dim)
+            hidden_states = gather_sequence(hidden_states, get_sequence_parallel_group(), dim=1, pad=get_temporal_pad())
+            t_dim, s_dim = hidden_states.shape[1], hidden_states.shape[2]
+            hidden_states = rearrange(hidden_states, "B T S C -> (B T) S C", T=t_dim, S=s_dim)
+
+        if self.is_input_patches:
+            if self.config.norm_type != ADA_NORM_SINGLE:
+                conditioning = self.transformer_blocks[0].norm1.emb(
+                    timestep, class_labels, hidden_dtype=hidden_states.dtype
+                )
+                shift, scale = self.proj_out_1(F.silu(conditioning)).chunk(2, dim=1)
+                hidden_states = self.norm_out(hidden_states) * (1 + scale[:, None]) + shift[:, None]
+                hidden_states = self.proj_out_2(hidden_states)
+            elif self.config.norm_type == ADA_NORM_SINGLE:
+                embedded_timestep = repeat(embedded_timestep, 'b d -> (b f) d', f=frame + use_image_num).contiguous()
+                shift, scale = (self.scale_shift_table[None] + embedded_timestep[:, None]).chunk(2, dim=1)
+                hidden_states = self.norm_out(hidden_states)
+                # Modulation
+                hidden_states = hidden_states * (1 + scale) + shift
+                hidden_states = self.proj_out(hidden_states)
+
+            # unpatchify
+            if self.adaln_single is None:
+                height = width = int(hidden_states.shape[1] ** 0.5)
+            hidden_states = hidden_states.reshape(
+                shape=(-1, height, width, self.patch_size, self.patch_size, self.out_channels)
+            )
+            hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)
+            output = hidden_states.reshape(
+                shape=(-1, self.out_channels, height * self.patch_size, width * self.patch_size)
+            )
+            output = rearrange(output, '(b f) c h w -> b c f h w', b=input_batch_size).contiguous()
+
+        if not return_dict:
+            return (output,)
+
+        return Transformer3DModelOutput(sample=output)
+
+    def _dynamic_switch(self, x, s, t, temporal_to_spatial: bool):
+        if temporal_to_spatial:
+            scatter_dim, gather_dim = 2, 1
+            scatter_pad = get_spatial_pad()
+            gather_pad = get_temporal_pad()
+        else:
+            scatter_dim, gather_dim = 1, 2
+            scatter_pad = get_temporal_pad()
+            gather_pad = get_spatial_pad()
+
+        x = all_to_all_with_pad(
+            x,
+            get_sequence_parallel_group(),
+            scatter_dim=scatter_dim,
+            gather_dim=gather_dim,
+            scatter_pad=scatter_pad,
+            gather_pad=gather_pad,
+        )
+        new_s, new_t = x.shape[2], x.shape[1]
+        x = rearrange(x, "b t s d -> (b t) s d")
+        return x, new_s, new_t
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        self.gradient_checkpointing = value
+
+
+def latte_t2v_8b(**kwargs):
+    return LatteT2V(num_layers=56, attention_head_dim=72, num_attention_heads=32, patch_size_t=1, patch_size=2,
+                    norm_type=ADA_NORM_SINGLE, caption_channels=4096, cross_attention_dim=2304, sample_size=[64, 64],
+                    in_channels=4, out_channels=8, **kwargs)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/model_load_utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/model_load_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d4f191caf99b16a5e06cea23fa411e60d04357f
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/model_load_utils.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright(C) 2024. Huawei Technologies Co.,Ltd. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+import os
+import torch
+import safetensors.torch
+
+SAFETENSORS_EXTENSION = "safetensors"
+
+
+def load_state_dict(model_path):
+    name = os.path.basename(model_path).split('.')[-1] # get weights name
+    if name == SAFETENSORS_EXTENSION: # diffuser model use same name
+        return safetensors.torch.load_file(model_path, device="cpu") # first load on cpu
+    else:
+        # to support hf shard model weights
+        return torch.load(model_path, map_location="cpu") # first load on cpu
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/parallel_mgr.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/parallel_mgr.py
new file mode 100644
index 0000000000000000000000000000000000000000..f70bed20ab6b47a3dc8672426b06ff0ff3c1928b
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/models/parallel_mgr.py
@@ -0,0 +1,64 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# --------------------------------------------------------
+# References:
+# DSP : https://github.com/NUS-HPC-AI-Lab/VideoSys
+# --------------------------------------------------------
+
+from dataclasses import dataclass
+import torch.distributed as dist
+from colossalai.cluster.process_group_mesh import ProcessGroupMesh
+from torch.distributed import ProcessGroup
+
+PARALLEL_MANAGER = None
+
+
+class ParallelManager(ProcessGroupMesh):
+    def __init__(self, sp_size, sp_axis):
+        super().__init__(sp_size)
+        self.sp_size = sp_size
+        self.sp_axis = sp_axis
+        self.sp_group: ProcessGroup = self.get_group_along_axis(sp_axis)
+        self.sp_rank = dist.get_rank(self.sp_group)
+        self.enable_sp = sp_size > 1
+
+
+def set_parallel_manager(sp_size, sp_axis):
+    global PARALLEL_MANAGER
+    PARALLEL_MANAGER = ParallelManager(sp_size, sp_axis)
+
+
+def get_sequence_parallel_group():
+    return PARALLEL_MANAGER.sp_group
+
+
+def get_sequence_parallel_size():
+    return PARALLEL_MANAGER.sp_size
+
+
+def get_sequence_parallel_rank():
+    return PARALLEL_MANAGER.sp_rank
+
+
+def use_sequence_parallel():
+    return PARALLEL_MANAGER.enable_sp
+
+
+def get_parallel_manager():
+    return PARALLEL_MANAGER
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..a73c00977a5ffc634eccb066afde7cab76a670d0
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/__init__.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .compile_pipe import compile_pipe
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/compile_pipe.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/compile_pipe.py
new file mode 100644
index 0000000000000000000000000000000000000000..576e965ab946eae6733c6799dd6710bf53a3ee49
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/compile_pipe.py
@@ -0,0 +1,47 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright(C) 2024. Huawei Technologies Co.,Ltd. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..utils import is_npu_available
+from ..acceleration.dit_cache_common import DiTCacheManager
+
+CFG_MAX_STEP = 10000
+
+
+def compile_pipe(pipe, cache_manager: DiTCacheManager = None,
+                 cfg_last_step: int = CFG_MAX_STEP):
+    if not isinstance(cfg_last_step, int):
+        raise TypeError(f"Expected int for cfg_last_step, but got {type(cfg_last_step).__name__}")
+
+    if is_npu_available():
+        device = 'npu'
+        if hasattr(pipe, "text_encoder"):
+            pipe.text_encoder.to(device)
+        else:
+            raise TypeError("Please input valid pipeline")
+        if hasattr(pipe, "transformer"):
+            pipe.transformer.to(device)
+        if hasattr(pipe, "vae"):
+            pipe.vae.to(device)
+
+        if cache_manager is not None:
+            if not hasattr(cache_manager, "use_cache"):
+                raise TypeError("Please input valid cache_manager")
+            pipe.transformer.cache_manager = cache_manager
+        if cfg_last_step != CFG_MAX_STEP:
+            pipe.cfg_last_step = cfg_last_step
+        return pipe
+    else:
+        raise RuntimeError("NPU is not available.")
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/open_sora_plan_pipeline.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/open_sora_plan_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..3710da219ab138d8e913aeb6ffe3fc35937a747c
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/open_sora_plan_pipeline.py
@@ -0,0 +1,744 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Callable, List, Optional, Tuple, Union
+import re
+from dataclasses import dataclass
+import html
+import inspect
+import urllib.parse as ul
+
+import torch
+import torch_npu
+from transformers import T5EncoderModel, T5Tokenizer
+
+from diffusers.models import AutoencoderKL, Transformer2DModel
+from diffusers.schedulers import DPMSolverMultistepScheduler
+from diffusers.utils import (
+    BACKENDS_MAPPING,
+    is_bs4_available,
+    is_ftfy_available,
+    replace_example_docstring,
+)
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.utils import BaseOutput
+from opensoraplan.utils.log import logger
+from opensoraplan import LatteParams
+from .pipeline_utils import OpenSoraPlanPipelineBase
+
+TENSOR_TYPE_PT = "pt"
+SUPPORT_VIDEO_LEN = [5, 17]
+SUPPORT_IMAGE_SIZE = [256, 512]
+
+
+if is_bs4_available():
+    from bs4 import BeautifulSoup
+
+if is_ftfy_available():
+    import ftfy
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from diffusers import PixArtAlphaPipeline
+
+        >>> # You can replace the checkpoint id with "PixArt-alpha/PixArt-XL-2-512x512" too.
+        >>> pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
+        >>> # Enable memory optimizations.
+        >>> pipe.enable_model_cpu_offload()
+
+        >>> prompt = "A small cactus with a happy face in the Sahara desert."
+        >>> image = pipe(prompt).images[0]
+        ```
+"""
+
+
+@dataclass
+class LatentsParams:
+    batch_size: int
+    num_channels_latents: int
+    video_length: int
+    height: int
+    width: int
+    dtype: torch.dtype
+    device: torch.device
+
+
+@dataclass
+class InputParams:
+    prompt: str
+    height: int
+    width: int
+    negative_prompt: str
+    callback_steps: int
+
+
+@dataclass
+class VideoPipelineOutput(BaseOutput):
+    video: torch.Tensor
+
+
+class OpenSoraPlanPipeline(OpenSoraPlanPipelineBase):
+    r"""
+    pipeline for text-to-image generation using PixArt-Alpha.
+
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+
+    Args:
+        vae ([`AutoencoderKL`]):
+            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+        text_encoder ([`T5EncoderModel`]):
+            Frozen text-encoder. PixArt-Alpha uses
+            [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
+            [t5-v1_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
+        tokenizer (`T5Tokenizer`):
+            Tokenizer of class
+            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
+        transformer ([`Transformer2DModel`]):
+            A text conditioned `Transformer2DModel` to denoise the encoded image latents.
+        scheduler ([`SchedulerMixin`]):
+            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+    """
+    bad_punct_regex = re.compile(
+        r"[" + "#®•©™&@·º½¾¿¡§~" + "\)" + "\(" + "\]" + "\[" + "\}" + "\{" + "\|" + "\\" + "\/" + "\*" + r"]{1,}"
+    )  # noqa
+
+    _optional_components = ["tokenizer", "text_encoder"]
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+
+    def __init__(
+            self,
+            tokenizer: T5Tokenizer,
+            text_encoder: T5EncoderModel,
+            vae: AutoencoderKL,
+            transformer: Transformer2DModel,
+            scheduler: DPMSolverMultistepScheduler,
+            video_length: int = 17,
+            image_size: int = 256
+    ):
+        super().__init__()
+        if video_length not in SUPPORT_VIDEO_LEN:
+            raise ValueError("Input video_length is not supported.")
+
+        if image_size not in SUPPORT_IMAGE_SIZE:
+            raise ValueError("Input image_size is not supported.")
+
+        torch.set_grad_enabled(False)
+
+        self.text_encoder = text_encoder
+        self.tokenizer = tokenizer
+        self.transformer = transformer
+        self.vae = vae
+        self.scheduler = scheduler
+        self.video_length = video_length
+        self.image_size = image_size
+        self.cfg_last_step = 10000
+
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+            self,
+            prompt: Union[str, List[str]] = None,
+            num_inference_steps: int = 20,
+            guidance_scale: float = 4.5,
+            num_images_per_prompt: Optional[int] = 1,
+            enable_temporal_attentions: bool = True,
+    ) -> Union[VideoPipelineOutput, Tuple]:
+        """
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            num_inference_steps (`int`, *optional*, defaults to 100):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            enable_temporal_attentions (`bool`, *optional*, defaults to True):
+                Whether to enable temporal attentions, if force images, the value should be set False.
+        Examples:
+
+        Returns:
+            [`~pipelines.ImagePipelineOutput`] or `tuple`:
+                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
+                returned where the first element is a list with the generated images
+        """
+        # 1. Check inputs. Raise error if not correct
+        negative_prompt = ""
+        eta = 0.0
+        generator = None
+        latents = None
+        prompt_embeds = None
+        negative_prompt_embeds = None
+        output_type = "pil"
+        return_dict = True
+        callback = None
+        callback_steps = 1
+        clean_caption = True
+        mask_feature = True
+
+        height = width = self.image_size
+        input_params = InputParams(prompt, height, width, negative_prompt, callback_steps)
+        self._check_inputs(input_params, prompt_embeds, negative_prompt_embeds)
+        if num_inference_steps < 4 or num_inference_steps > 300:
+            raise ValueError("num_inference_steps should be in the range of [4, 300].")
+        if self.transformer.cache_manager.start_step < 0 or \
+            self.transformer.cache_manager.start_step > (num_inference_steps - 1):
+            raise ValueError("start_step should be in the range of [0, num_inference_steps-1]")
+        if self.transformer.cache_manager.step_interval < 1 or \
+            self.transformer.cache_manager.step_interval > (num_inference_steps - 2):
+            raise ValueError("step_interval should be in the range of [1, num_inference_steps-2]")
+        if num_images_per_prompt < 1 or num_images_per_prompt > 100:
+            raise ValueError("num_images_per_prompt should be in the range of [1, 100].")
+        if self.cfg_last_step < 0:
+            raise ValueError("cfg_last_step should be not less than 0.")
+
+        # 2. Default height and width to transformer
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        device = self.text_encoder.device or self._execution_device
+
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+
+        # 3. Encode input prompt
+        prompt_embeds, negative_prompt_embeds = self._encode_prompt(
+            prompt,
+            do_classifier_free_guidance,
+            negative_prompt=negative_prompt,
+            num_images_per_prompt=num_images_per_prompt,
+            device=device,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            clean_caption=clean_caption,
+            mask_feature=mask_feature,
+        )
+        torch.npu.empty_cache()
+
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+
+        # 4. Prepare timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps = self.scheduler.timesteps
+
+        # 5. Prepare latents.
+        latent_channels = self.transformer.config.in_channels
+        latents_params = LatentsParams(batch_size * num_images_per_prompt, latent_channels, self.video_length,
+                                       height, width, prompt_embeds.dtype, device)
+        latents = self._prepare_latents(latents_params, generator, latents)
+
+        # 6. Prepare extra step kwargs.
+        extra_step_kwargs = self._prepare_extra_step_kwargs(generator, eta)
+
+        # 6.1 Prepare micro-conditions.
+        added_cond_kwargs = {"resolution": None, "aspect_ratio": None}
+
+        # 7. Denoising loop
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if i == self.cfg_last_step:
+                    prompt_embeds = prompt_embeds[1:2]
+                if i >= self.cfg_last_step:
+                    do_classifier_free_guidance = False
+
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+
+                current_timestep = t
+                if not torch.is_tensor(current_timestep):
+                    # This would be a good case for the `match` statement (Python 3.10+)
+                    is_mps = latent_model_input.device.type == "mps"
+                    if isinstance(current_timestep, float):
+                        dtype = torch.float32 if is_mps else torch.float64
+                    else:
+                        dtype = torch.int32 if is_mps else torch.int64
+                    current_timestep = torch.tensor([current_timestep], dtype=dtype, device=latent_model_input.device)
+                elif len(current_timestep.shape) == 0:
+                    current_timestep = current_timestep[None].to(latent_model_input.device)
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                current_timestep = current_timestep.expand(latent_model_input.shape[0])
+
+                latte_params = LatteParams(
+                    hidden_states=latent_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=current_timestep,
+                    added_cond_kwargs=added_cond_kwargs,
+                    enable_temporal_attentions=enable_temporal_attentions,
+                )
+                # predict noise model_output
+                noise_pred = self.transformer(
+                    latte_params,
+                    t_idx=i,
+                )[0]
+
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+                # learned sigma
+                if self.transformer.config.out_channels // 2 == latent_channels:
+                    noise_pred = noise_pred.chunk(2, dim=1)[0]
+                else:
+                    noise_pred = noise_pred
+
+                # compute previous image: x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        step_idx = i // getattr(self.scheduler, "order", 1)
+                        callback(step_idx, t, latents)
+
+        if not output_type == 'latents':
+            video = self._decode_latents(latents)
+        else:
+            video = latents
+            return VideoPipelineOutput(video=video)
+
+        if not return_dict:
+            return (video,)
+
+        return VideoPipelineOutput(video=video)
+
+    # Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/utils.py
+    def _mask_text_embeddings(self, emb, mask):
+        if emb.shape[0] == 1:
+            keep_index = mask.sum().item()
+            return emb[:, :, :keep_index, :], keep_index  # 1, 120, 4096 -> 1 7 4096
+        else:
+            masked_feature = emb * mask[:, None, :, None]  # 1 120 4096
+            return masked_feature, emb.shape[2]
+
+    # Adapted from diffusers.pipelines.deepfloyd_if.pipeline_if.encode_prompt
+    def _encode_prompt(
+            self,
+            prompt: Union[str, List[str]],
+            do_classifier_free_guidance: bool = True,
+            negative_prompt: str = "",
+            num_images_per_prompt: int = 1,
+            device: Optional[torch.device] = None,
+            prompt_embeds: Optional[torch.FloatTensor] = None,
+            negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+            clean_caption: bool = False,
+            mask_feature: bool = True,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds`
+                instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). For
+                PixArt-Alpha, this should be "".
+            do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
+                whether to use classifier free guidance or not
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                number of images that should be generated per prompt
+            device: (`torch.device`, *optional*):
+                torch device to place the resulting embeddings on
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative text embeddings. For PixArt-Alpha, it's should be the embeddings of the ""
+                string.
+            clean_caption (bool, defaults to `False`):
+                If `True`, the function will preprocess and clean the provided caption before encoding.
+            mask_feature: (bool, defaults to `True`):
+                If `True`, the function will mask the text embeddings.
+        """
+        embeds_initially_provided = prompt_embeds is not None and negative_prompt_embeds is not None
+
+        if device is None:
+            device = self.text_encoder.device or self._execution_device
+
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        # See Section 3.1. of the paper.
+        max_length = 300
+
+        if prompt_embeds is None:
+            prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
+            text_inputs = self.tokenizer(
+                prompt,
+                padding="max_length",
+                max_length=max_length,
+                truncation=True,
+                return_attention_mask=True,
+                add_special_tokens=True,
+                return_tensors=TENSOR_TYPE_PT,
+            )
+            text_input_ids = text_inputs.input_ids
+            untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors=TENSOR_TYPE_PT).input_ids
+
+            if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
+                    text_input_ids, untruncated_ids
+            ):
+                removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_length - 1: -1])
+                logger.warning(
+                    "The following part of your input was truncated because the model can only handle sequences up to"
+                    f" {max_length} tokens: {removed_text}"
+                )
+
+            attention_mask = text_inputs.attention_mask.to(device)
+            prompt_embeds_attention_mask = attention_mask
+
+            prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask)
+            prompt_embeds = prompt_embeds[0]
+        else:
+            prompt_embeds_attention_mask = torch.ones_like(prompt_embeds)
+
+        if self.text_encoder is not None:
+            dtype = self.text_encoder.dtype
+        elif self.transformer is not None:
+            dtype = self.transformer.dtype
+        else:
+            dtype = None
+
+        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+
+        bs_embed, seq_len, _ = prompt_embeds.shape
+        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
+        prompt_embeds_attention_mask = prompt_embeds_attention_mask.view(bs_embed, -1)
+        prompt_embeds_attention_mask = prompt_embeds_attention_mask.repeat(num_images_per_prompt, 1)
+
+        # get unconditional embeddings for classifier free guidance
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            uncond_tokens = [negative_prompt] * batch_size
+            uncond_tokens = self._text_preprocessing(uncond_tokens, clean_caption=clean_caption)
+            max_length = prompt_embeds.shape[1]
+            uncond_input = self.tokenizer(
+                uncond_tokens,
+                padding="max_length",
+                max_length=max_length,
+                truncation=True,
+                return_attention_mask=True,
+                add_special_tokens=True,
+                return_tensors=TENSOR_TYPE_PT,
+            )
+            attention_mask = uncond_input.attention_mask.to(device)
+
+            negative_prompt_embeds = self.text_encoder(
+                uncond_input.input_ids.to(device),
+                attention_mask=attention_mask,
+            )
+            negative_prompt_embeds = negative_prompt_embeds[0]
+
+        if do_classifier_free_guidance:
+            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
+            seq_len = negative_prompt_embeds.shape[1]
+
+            negative_prompt_embeds = negative_prompt_embeds.to(dtype=dtype, device=device)
+
+            negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
+            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+
+            # For classifier free guidance, we need to do two forward passes.
+            # Here we concatenate the unconditional and text embeddings into a single batch
+            # to avoid doing two forward passes
+        else:
+            negative_prompt_embeds = None
+
+        # Perform additional masking.
+        if mask_feature and not embeds_initially_provided:
+            prompt_embeds = prompt_embeds.unsqueeze(1)
+            masked_prompt_embeds, keep_indices = self._mask_text_embeddings(prompt_embeds, prompt_embeds_attention_mask)
+            masked_prompt_embeds = masked_prompt_embeds.squeeze(1)
+            masked_negative_prompt_embeds = (
+                negative_prompt_embeds[:, :keep_indices, :] if negative_prompt_embeds is not None else None
+            )
+
+            return masked_prompt_embeds, masked_negative_prompt_embeds
+
+        return prompt_embeds, negative_prompt_embeds
+
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.
+    # prepare_extra_step_kwargs
+    def _prepare_extra_step_kwargs(self, generator, eta):
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+
+    def _check_inputs(
+            self,
+            inpt_params,
+            prompt_embeds=None,
+            negative_prompt_embeds=None,
+    ):
+        if inpt_params.height % 8 != 0 or inpt_params.width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are "
+                             f"{inpt_params.height} and {inpt_params.width}.")
+
+        callback_not_none = ((inpt_params.callback_steps is not None) and
+                             (not isinstance(inpt_params.callback_steps, int) or inpt_params.callback_steps <= 0))
+        if (inpt_params.callback_steps is None) or callback_not_none:
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {inpt_params.callback_steps} of type"
+                f" {type(inpt_params.callback_steps)}."
+            )
+
+        if inpt_params.prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {inpt_params.prompt} and `prompt_embeds`: {prompt_embeds}. "
+                f"Please make sure to only forward one of the two."
+            )
+        elif inpt_params.prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif inpt_params.prompt is not None and (not isinstance(inpt_params.prompt, str) and
+                                                 not isinstance(inpt_params.prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(inpt_params.prompt)}")
+
+        if inpt_params.prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {inpt_params.prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+        if inpt_params.negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {inpt_params.negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
+    def _prepare_latents(self, latents_params, generator, latents=None):
+        shape = (
+            latents_params.batch_size,
+            latents_params.num_channels_latents,
+            latents_params.video_length,
+            self.vae.latent_size[0],
+            self.vae.latent_size[1]
+        )
+        if isinstance(generator, list) and len(generator) != latents_params.batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {latents_params.batch_size}. Make sure the batch size matches the length of the generators."
+            )
+
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=latents_params.device, dtype=latents_params.dtype)
+        else:
+            latents = latents.to(latents_params.device)
+
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * self.scheduler.init_noise_sigma
+        return latents
+
+    def _decode_latents(self, latents):
+        video = self.vae.decode(latents)
+        video = ((video / 2.0 + 0.5).clamp(0, 1) * 255).to(dtype=torch.uint8).cpu().permute(0, 1, 3, 4, 2).contiguous()
+        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
+        return video
+
+    # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._text_preprocessing
+    def _text_preprocessing(self, text, clean_caption=False):
+        if clean_caption and not is_bs4_available():
+            logger.warning(BACKENDS_MAPPING["bs4"][-1].format("Setting `clean_caption=True`"))
+            logger.warning("Setting `clean_caption` to False...")
+            clean_caption = False
+
+        if clean_caption and not is_ftfy_available():
+            logger.warning(BACKENDS_MAPPING["ftfy"][-1].format("Setting `clean_caption=True`"))
+            logger.warning("Setting `clean_caption` to False...")
+            clean_caption = False
+
+        if not isinstance(text, (tuple, list)):
+            text = [text]
+
+        def process(text: str):
+            if clean_caption:
+                text = self._clean_caption(text)
+                text = self._clean_caption(text)
+            else:
+                text = text.lower().strip()
+            return text
+
+        return [process(t) for t in text]
+
+    # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._clean_caption
+    def _clean_caption(self, caption):
+        caption = str(caption)
+        caption = ul.unquote_plus(caption)
+        caption = caption.strip().lower()
+        caption = re.sub("<person>", "person", caption)
+        # urls:
+        caption = re.sub(
+            r"\b((?:https?:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.]"
+            r"(?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",
+            # noqa
+            "",
+            caption,
+        )  # regex for urls
+        caption = re.sub(
+            r"\b((?:www:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)"
+            r"[\w/-]*\b\/?(?!@)))",
+            # noqa
+            "",
+            caption,
+        )  # regex for urls
+        # html:
+        caption = BeautifulSoup(caption, features="html.parser").text
+
+        # @<nickname>
+        caption = re.sub(r"@[\w\d]+\b", "", caption)
+
+        # 31C0—31EF CJK Strokes
+        # 31F0—31FF Katakana Phonetic Extensions
+        # 3200—32FF Enclosed CJK Letters and Months
+        # 3300—33FF CJK Compatibility
+        # 3400—4DBF CJK Unified Ideographs Extension A
+        # 4DC0—4DFF Yijing Hexagram Symbols
+        # 4E00—9FFF CJK Unified Ideographs
+        caption = re.sub(r"[\u31c0-\u31ef]+", "", caption)
+        caption = re.sub(r"[\u31f0-\u31ff]+", "", caption)
+        caption = re.sub(r"[\u3200-\u32ff]+", "", caption)
+        caption = re.sub(r"[\u3300-\u33ff]+", "", caption)
+        caption = re.sub(r"[\u3400-\u4dbf]+", "", caption)
+        caption = re.sub(r"[\u4dc0-\u4dff]+", "", caption)
+        caption = re.sub(r"[\u4e00-\u9fff]+", "", caption)
+        #######################################################
+
+        # все виды тире / all types of dash --> "-"
+        caption = re.sub(
+            r"[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030"
+            r"\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]+",
+            # noqa
+            "-",
+            caption,
+        )
+
+        # кавычки к одному стандарту
+        caption = re.sub(r"[`´«»“”¨]", '"', caption)
+        caption = re.sub(r"[‘’]", "'", caption)
+
+        # &quot;
+        caption = re.sub(r"&quot;?", "", caption)
+        # &amp
+        caption = re.sub(r"&amp", "", caption)
+
+        # ip adresses:
+        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)
+
+        # article ids:
+        caption = re.sub(r"\d:\d\d\s+$", "", caption)
+
+        # \n
+        caption = re.sub(r"\\n", " ", caption)
+
+        # "#123"
+        caption = re.sub(r"#\d{1,3}\b", "", caption)
+        # "#12345.."
+        caption = re.sub(r"#\d{5,}\b", "", caption)
+        # "123456.."
+        caption = re.sub(r"\b\d{6,}\b", "", caption)
+        # filenames:
+        caption = re.sub(r"[\S]+\.(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)", "", caption)
+
+        #
+        caption = re.sub(r"[\"\']{2,}", r'"', caption)  # """AUSVERKAUFT"""
+        caption = re.sub(r"[\.]{2,}", r" ", caption)  # """AUSVERKAUFT"""
+
+        caption = re.sub(self.bad_punct_regex, r" ", caption)  # ***AUSVERKAUFT***, #AUSVERKAUFT
+        caption = re.sub(r"\s+\.\s+", r" ", caption)  # " . "
+
+        # this-is-my-cute-cat / this_is_my_cute_cat
+        regex2 = re.compile(r"(?:\-|\_)")
+        if len(re.findall(regex2, caption)) > 3:
+            caption = re.sub(regex2, " ", caption)
+
+        caption = ftfy.fix_text(caption)
+        caption = html.unescape(html.unescape(caption))
+
+        caption = re.sub(r"\b[a-zA-Z]{1,3}\d{3,15}\b", "", caption)  # jc6640
+        caption = re.sub(r"\b[a-zA-Z]+\d+[a-zA-Z]+\b", "", caption)  # jc6640vc
+        caption = re.sub(r"\b\d+[a-zA-Z]+\d+\b", "", caption)  # 6640vc231
+
+        caption = re.sub(r"(worldwide\s+)?(free\s+)?shipping", "", caption)
+        caption = re.sub(r"(free\s)?download(\sfree)?", "", caption)
+        caption = re.sub(r"\bclick\b\s(?:for|on)\s\w+", "", caption)
+        caption = re.sub(r"\b(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)(\simage[s]?)?", "", caption)
+        caption = re.sub(r"\bpage\s+\d+\b", "", caption)
+
+        caption = re.sub(r"\b\d*[a-zA-Z]+\d+[a-zA-Z]+\d+[a-zA-Z\d]*\b", r" ", caption)  # j2d1a2a...
+
+        caption = re.sub(r"\b\d+\.?\d*[xх×]\d+\.?\d*\b", "", caption)
+
+        caption = re.sub(r"\b\s+\:\s+", r": ", caption)
+        caption = re.sub(r"(\D[,\./])\b", r"\1 ", caption)
+        caption = re.sub(r"\s+", " ", caption)
+
+        caption.strip()
+
+        caption = re.sub(r"^[\"\']([\w\W]+)[\"\']$", r"\1", caption)
+        caption = re.sub(r"^[\'\_,\-\:;]", r"", caption)
+        caption = re.sub(r"[\'\_,\-\:\-\+]$", r"", caption)
+        caption = re.sub(r"^\.\S+$", "", caption)
+
+        return caption.strip()
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/pipeline_utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/pipeline_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb9e6c5553a3f7990116900ad59b111c48615821
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/pipeline/pipeline_utils.py
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import os
+import inspect
+import logging
+import importlib
+from dataclasses import dataclass
+
+import torch
+from torch import Tensor
+from tqdm import tqdm
+from diffusers.schedulers import PNDMScheduler
+
+from opensoraplan.utils.utils import path_check
+
+from mindiesd import ConfigMixin
+
+PIPELINE_CONFIG_NAME = "model_index.json"
+VAE = 'vae'
+TEXT_ENCODER = 'text_encoder'
+TOKENIZER = 'tokenizer'
+TRANSFORMER = 'transformer'
+SCHEDULER = 'scheduler'
+
+IMAGE_SIZE = 'image_size'
+ENABLE_SEQUENCE_PARALLELISM = 'enable_sequence_parallelism'
+FPS = 'fps'
+DTYPE = 'dtype'
+SET_PATCH_PARALLEL = 'set_patch_parallel'
+FROM_PRETRAINED = 'from_pretrained'
+NUM_SAMPLING_STEPS = 'num_sampling_steps'
+REFINE_SERVER_IP = 'refine_server_ip'
+REFINE_SERVER_PORT = 'refine_server_port'
+MODEL_TYPE = 'model_type'
+logger = logging.getLogger(__name__)  # init python log
+
+OPEN_SORA_PLAN_DEFAULT_IMAGE_SIZE = 512
+OPEN_SORA_PLAN_DEFAULT_VIDEO_LENGTH = 17
+OPEN_SORA_PLAN_DEFAULT_CACHE_DIR = "cache_dir"
+OPEN_SORA_PLAN_DEFAULT_VAE_STRIDE = 8
+OPEN_SORA_PLAN_DEFAULT_SCHEDULER = PNDMScheduler
+OPEN_SORA_PLAN_DEFAULT_DTYPE = torch.float16
+
+CACHE_DIR = "cache_dir"
+VAE_STRIDE = "vae_stride"
+VIDEO_LENGTH = "video_length"
+
+
+class OpenSoraPlanPipelineBase(ConfigMixin):
+    config_name = PIPELINE_CONFIG_NAME
+
+    def __init__(self):
+        super().__init__()
+        if not hasattr(self, "_progress_bar_config"):
+            self._progress_bar_config = {}
+        elif not isinstance(self._progress_bar_config, dict):
+            raise ValueError(
+                f"`self._progress_bar_config` should be of type `dict`, but is {type(self._progress_bar_config)}."
+            )
+
+    @classmethod
+    def from_pretrained(cls, model_path, **kwargs):
+        initializers = {
+            TEXT_ENCODER: init_text_encoder_plan,
+            VAE: init_vae_plan,
+            TOKENIZER: init_default_plan
+        }
+
+        image_size = kwargs.pop(IMAGE_SIZE, OPEN_SORA_PLAN_DEFAULT_IMAGE_SIZE)
+        dtype = kwargs.pop(DTYPE, OPEN_SORA_PLAN_DEFAULT_DTYPE)
+        cache_dir = kwargs.pop(CACHE_DIR, OPEN_SORA_PLAN_DEFAULT_CACHE_DIR)
+        vae_stride = kwargs.pop(VAE_STRIDE, OPEN_SORA_PLAN_DEFAULT_VAE_STRIDE)
+        if vae_stride != 8:
+            raise ValueError("Unsupported vae_stride.")
+        scheduler = kwargs.pop(SCHEDULER, OPEN_SORA_PLAN_DEFAULT_SCHEDULER)
+
+        real_path = path_check(model_path)
+        init_dict, config_dict = cls.load_config(real_path, **kwargs)
+
+        init_list = [VAE, TEXT_ENCODER, TOKENIZER, TRANSFORMER, SCHEDULER]
+        pipe_init_dict = {}
+        model_init_dict = {}
+
+        all_parameters = inspect.signature(cls.__init__).parameters
+
+        required_param = {k: v for k, v in all_parameters.items() if v.default == inspect.Parameter.empty}
+        expected_modules = set(required_param.keys()) - {"self"}
+        # init the module from kwargs
+        passed_module = {k: kwargs.pop(k) for k in expected_modules if k in kwargs}
+        pipe_init_dict[IMAGE_SIZE] = image_size
+        model_init_dict[IMAGE_SIZE] = image_size
+        model_init_dict[DTYPE] = dtype
+        model_init_dict[CACHE_DIR] = cache_dir
+        model_init_dict[VAE_STRIDE] = vae_stride
+
+        for key in tqdm(init_list, desc="Loading open-sora-plan-pipeline compenents"):
+            if key not in init_dict:
+                raise ValueError(f"Get {key} from init config failed!")
+            if key in passed_module:
+                pipe_init_dict[key] = passed_module.pop(key)
+            else:
+                modules, cls_name = init_dict[key]
+                if modules == "mindiesd":
+                    library = importlib.import_module("opensoraplan")
+                else:
+                    library = importlib.import_module(modules)
+                class_obj = getattr(library, cls_name)
+                sub_folder = os.path.join(real_path, key)
+
+                if key == TRANSFORMER:
+                    if pipe_init_dict.get(VAE) is None:
+                        raise ValueError("Cannot get module 'vae' in init list!")
+
+                    if pipe_init_dict.get(TEXT_ENCODER) is None:
+                        raise ValueError("Cannot get module 'text_encoder' in init list!")
+
+                    pipe_init_dict[key] = class_obj.from_pretrained(sub_folder, cache_dir=model_init_dict[CACHE_DIR],
+                                                                    torch_dtype=model_init_dict[DTYPE], **kwargs)
+                elif key == SCHEDULER:
+                    pipe_init_dict[key] = scheduler
+                else:
+                    initializer = initializers.get(key, init_default_plan)
+                    pipe_init_dict[key] = initializer(class_obj, sub_folder, model_init_dict, kwargs)
+
+        if pipe_init_dict.get(TRANSFORMER) is None:
+            raise ValueError("Cannot get module 'transformer' in init list!")
+        video_length = pipe_init_dict.get(TRANSFORMER).config.video_length
+        pipe_init_dict[VIDEO_LENGTH] = video_length
+
+        return cls(**pipe_init_dict)
+
+    def progress_bar(self, iterable=None, total=None):
+        if iterable is not None:
+            return tqdm(iterable, **self._progress_bar_config)
+        elif total is not None:
+            return tqdm(total=total, **self._progress_bar_config)
+        else:
+            raise ValueError("Either `total` or `iterable` has to be defined.")
+
+
+def init_text_encoder_plan(class_obj, sub_folder, model_init_dict, kwargs):
+    return class_obj.from_pretrained(sub_folder, cache_dir=model_init_dict[CACHE_DIR],
+                                     torch_dtype=model_init_dict[DTYPE])
+
+
+def init_vae_plan(class_obj, sub_folder, model_init_dict, kwargs):
+    height = width = model_init_dict[IMAGE_SIZE] // model_init_dict[VAE_STRIDE]
+    latent_size = (height, width)
+    vae = class_obj.from_pretrained(sub_folder, latent_size, cache_dir=model_init_dict[CACHE_DIR],
+                                    **kwargs).to(dtype=model_init_dict[DTYPE])
+    return vae
+
+
+def init_default_plan(class_obj, sub_folder, model_init_dict, kwargs):
+    return class_obj.from_pretrained(sub_folder, **kwargs)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/.gitkeep b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/.gitkeep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..30af36f396102c760fe85408124eba70c9c93bb7
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/__init__.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/scheduler_optimizer.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/scheduler_optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9050ce2ceff1f278c86485043bc111d211eba262
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/schedulers/scheduler_optimizer.py
@@ -0,0 +1,49 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from diffusers.schedulers import (DDIMScheduler, DDPMScheduler, PNDMScheduler,
+                                  EulerDiscreteScheduler, DPMSolverMultistepScheduler,
+                                  HeunDiscreteScheduler, EulerAncestralDiscreteScheduler,
+                                  DEISMultistepScheduler, KDPM2AncestralDiscreteScheduler)
+from diffusers.schedulers.scheduling_dpmsolver_singlestep import DPMSolverSinglestepScheduler
+
+
+def get_scheduler(sample_method):
+    if sample_method == 'DDIM':
+        scheduler = DDIMScheduler()
+    elif sample_method == 'EulerDiscrete':
+        scheduler = EulerDiscreteScheduler()
+    elif sample_method == 'DDPM':
+        scheduler = DDPMScheduler()
+    elif sample_method == 'DPMSolverMultistep':
+        scheduler = DPMSolverMultistepScheduler()
+    elif sample_method == 'DPMSolverSinglestep':
+        scheduler = DPMSolverSinglestepScheduler()
+    elif sample_method == 'PNDM':
+        scheduler = PNDMScheduler()
+    elif sample_method == 'HeunDiscrete':
+        scheduler = HeunDiscreteScheduler()
+    elif sample_method == 'EulerAncestralDiscrete':
+        scheduler = EulerAncestralDiscreteScheduler()
+    elif sample_method == 'DEISMultistep':
+        scheduler = DEISMultistepScheduler()
+    elif sample_method == 'KDPM2AncestralDiscrete':
+        scheduler = KDPM2AncestralDiscreteScheduler()
+    else:
+        raise ValueError('ERROR: wrong sample_method given !!!')
+    return scheduler
+
+
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/.gitkeep b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/.gitkeep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/__init__.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..226b35a40990bc60c5f8772219aea4a5970c4802
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/__init__.py
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from .utils import (
+    set_random_seed, is_npu_available)
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/log.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/log.py
new file mode 100644
index 0000000000000000000000000000000000000000..eeda409158151f269d726148117324a062bc07e4
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/log.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..813c1c4190f431d60c007a261d87375b04351076
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/opensoraplan/utils/utils.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import multiprocessing
+import random
+import os
+import importlib
+import time
+from dataclasses import dataclass
+from multiprocessing import Manager, shared_memory
+from threading import Timer
+
+import numpy as np
+import torch
+import torchvision.io as io
+import torch.distributed as dist
+
+import requests
+
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+MASK_DEFAULT = ["0", "0", "0", "0", "1", "0"]
+MAX_SHM_SIZE = 10**9
+OPENAI_CLIENT = None
+REFINE_PROMPTS = None
+REFINE_PROMPTS_TEMPLATE = """
+You need to refine user's input prompt. The user's input prompt is used for video generation task. You need to refine the user's prompt to make it more suitable for the task. Here are some examples of refined prompts:
+{}
+
+The refined prompt should pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. The refined prompt should be in English.
+"""
+RANDOM_PROMPTS = None
+RANDOM_PROMPTS_TEMPLATE = """
+You need to generate one input prompt for video generation task. The prompt should be suitable for the task. Here are some examples of refined prompts:
+{}
+
+The prompt should pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. The prompt should be in English.
+"""
+REFINE_EXAMPLE = [
+    "a close - up shot of a woman standing in a room with a white wall and a plant on the left side."
+    "the woman has curly hair and is wearing a green tank top."
+    "she is looking to the side with a neutral expression on her face."
+    "the lighting in the room is soft and appears to be natural, coming from the left side of the frame."
+    "the focus is on the woman, with the background being out of focus."
+    "there are no texts or other objects in the video.the style of the video is a simple,"
+    " candid portrait with a shallow depth of field.",
+    "a serene scene of a pond filled with water lilies.the water is a deep blue, "
+    "providing a striking contrast to the pink and white flowers that float on its surface."
+    "the flowers, in full bloom, are the main focus of the video."
+    "they are scattered across the pond, with some closer to the camera and others further away, "
+    "creating a sense of depth.the pond is surrounded by lush greenery, adding a touch of nature to the scene."
+    "the video is taken from a low angle, looking up at the flowers, "
+    "which gives a unique perspective and emphasizes their beauty."
+    "the overall composition of the video suggests a peaceful and tranquil setting, likely a garden or a park.",
+    "a professional setting where a woman is presenting a slide from a presentation."
+    "she is standing in front of a projector screen, which displays a bar chart."
+    "the chart is colorful, with bars of different heights, indicating some sort of data comparison."
+    "the woman is holding a pointer, which she uses to highlight specific parts of the chart."
+    "she is dressed in a white blouse and black pants, and her hair is styled in a bun."
+    "the room has a modern design, with a sleek black floor and a white ceiling."
+    "the lighting is bright, illuminating the woman and the projector screen."
+    "the focus of the image is on the woman and the projector screen, with the background being out of focus."
+    "there are no texts visible in the image."
+    "the relative positions of the objects suggest that the woman is the main subject of the image, "
+    "and the projector screen is the object of her attention."
+    "the image does not provide any information about the content of the presentation or the context of the meeting."
+]
+MAX_NEW_TOKENS = 512
+TEMPERATURE = 1.1
+TOP_P = 0.95
+TOP_K = 100
+SEED = 10
+REPETITION_PENALTY = 1.03
+
+TIMEOUT_T = 600
+
+
+def is_npu_available():
+    "Checks if `torch_npu` is installed and potentially if a NPU is in the environment"
+        
+    if importlib.util.find_spec("torch") is None or importlib.util.find_spec("torch_npu") is None:
+        return False
+
+    import torch_npu
+
+    try:
+        # Will raise a RuntimeError if no NPU is found
+        _ = torch.npu.device_count()
+        return torch.npu.is_available()
+    except RuntimeError:
+        return False
+
+
+def set_random_seed(seed):
+    """Set random seed.
+
+    Args:
+        seed (int, optional): Seed to be used.
+
+    """
+
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    return seed
+
+
+def path_check(path: str):
+    """
+    check path
+    param: path
+    return: data real path after check
+    """
+    if os.path.islink(path) or path is None:
+        raise RuntimeError("The path should not be None or a symbolic link file.")
+    path = os.path.realpath(path)
+    if not check_owner(path):
+        raise RuntimeError("The path is not owned by current user or root.")
+    if not os.path.exists(path):
+        raise RuntimeError("The path does not exist.")
+    return path
+
+
+def check_owner(path: str):
+    """
+    check the path owner
+    param: the input path
+    return: whether the path owner is current user or not
+    """
+    path_stat = os.stat(path)
+    path_owner, path_gid = path_stat.st_uid, path_stat.st_gid
+    user_check = path_owner == os.getuid() and path_owner == os.geteuid()
+    return path_owner == 0 or path_gid in os.getgroups() or user_check
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/requirements.txt b/MindIE/MultiModal/OpenSoraPlan-1.0/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ff5c2e6585467220b25a08047d4a926bd27fb04d
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/requirements.txt
@@ -0,0 +1,47 @@
+torch==2.1.0
+torchvision==0.16.0
+transformers==4.39.1
+accelerate==0.28.0
+albumentations==1.4.0
+av==11.0.0
+einops==0.7.0
+fastapi==0.110.0
+gdown==5.1.0
+h5py==3.10.0
+idna==3.6
+imageio==2.34.0
+matplotlib==3.7.5
+numpy==1.24.4
+omegaconf==2.1.1
+opencv-python==4.9.0.80
+opencv-python-headless==4.9.0.80
+pandas==2.0.3
+pillow==10.2.0
+pydub==0.25.1
+pytorch-lightning==2.2.1
+pytorchvideo==0.1.5
+PyYAML==6.0.1
+regex==2023.12.25
+requests==2.31.0
+scikit-learn==1.3.2
+scipy==1.10.1
+six==1.16.0
+test-tube==0.7.5
+timm==0.9.16
+torchdiffeq==0.2.3
+torchmetrics==1.3.2
+tqdm==4.66.2
+urllib3==2.2.1
+uvicorn==0.27.1
+diffusers==0.27.2
+scikit-video==1.1.11
+imageio-ffmpeg==0.4.9
+sentencepiece==0.1.99
+beautifulsoup4==4.12.3
+ftfy==6.1.3
+moviepy==1.0.3
+wandb==0.16.3
+tensorboard==2.14.0
+pydantic==2.6.4
+gradio==4.0.0
+huggingface-hub==0.25.1
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/configmixin.json b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/configmixin.json
new file mode 100644
index 0000000000000000000000000000000000000000..c36d39a62dfd3efad741a75fb40bec078c99f7b5
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/configmixin.json
@@ -0,0 +1,4 @@
+{
+    "used_key": "used_key",
+    "noused_key": "noused_key"
+}
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/invalid.json b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/invalid.json
new file mode 100644
index 0000000000000000000000000000000000000000..ec50e78b0d4583552d4140d44dc8e42cf48d325d
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/invalid.json
@@ -0,0 +1 @@
+12312312312312354, 123123123123213
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_invalid_test.json b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_invalid_test.json
new file mode 100644
index 0000000000000000000000000000000000000000..4ea7ff8666befde67c08b991249aa1a4583750cf
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_invalid_test.json
@@ -0,0 +1,5 @@
+{
+    "num_timesteps": 30,
+    "num_sampling_steps": 1000,
+    "sample_method": "UNIFORM_CONSTANT"
+}
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_test.json b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_test.json
new file mode 100644
index 0000000000000000000000000000000000000000..ee0c5aae5b3b5e4c172bb3a01e053ab0b6d2a769
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/configs/scheduler_config_test.json
@@ -0,0 +1,6 @@
+{
+    "num_timesteps": 30,
+    "num_sampling_steps": 1000,
+    "sample_method": "UNIFORM_CONSTANT",
+    "loc": 0.0
+}
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/models/test_causalvae.py b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/models/test_causalvae.py
new file mode 100644
index 0000000000000000000000000000000000000000..a31ffadba8001f7c1152ea7b0e4f5ab081d095b8
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/models/test_causalvae.py
@@ -0,0 +1,121 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import os
+import sys
+import torch
+import torch_npu
+
+import colossalai
+
+sys.path.append(os.path.split(sys.path[0])[0])
+
+from opensoraplan.models.causalvae.modeling_causalvae import DiagonalGaussianDistribution, CausalVAEModel
+
+SEED = 5464
+MASTER_PORT = '42043'
+PROMPT = ["A cat playing with a ball"]
+
+
+class TestDiagonalGaussianDistribution(unittest.TestCase):
+    def setUp(self):
+        self.device = torch.device("npu" if torch.npu.is_available() else "cpu")
+        self.parameters = torch.randn([10, 20, 10, 20]).to(self.device)
+        self.diagonal_gaussian = DiagonalGaussianDistribution(self.parameters)
+
+    def test_sample(self):
+        sample = self.diagonal_gaussian.sample()
+        self.assertEqual(sample.shape, self.parameters[:, :10].shape)
+
+    def test_kl(self):
+        kl = self.diagonal_gaussian.kl()
+        self.assertEqual(kl.shape, torch.Size([10]))
+
+    def test_kl_with_other(self):
+        other_parameters = torch.randn([10, 20, 10, 20]).to(self.device)
+        other_diagonal_gaussian = DiagonalGaussianDistribution(other_parameters)
+        kl = self.diagonal_gaussian.kl(other_diagonal_gaussian)
+        self.assertEqual(kl.shape, torch.Size([10]))
+
+    def test_kl_with_deterministic(self):
+        other_parameters = torch.randn([10, 20, 10, 20]).to(self.device)
+        other_diagonal_gaussian = DiagonalGaussianDistribution(other_parameters, deterministic=True)
+        kl = other_diagonal_gaussian.kl()
+        self.assertEqual(kl.shape, torch.Size([1]))
+
+    def test_nll(self):
+        sample = self.diagonal_gaussian.sample()
+        nll = self.diagonal_gaussian.nll(sample)
+        self.assertEqual(nll.shape, torch.Size([10]))
+
+    def test_mode(self):
+        mode = self.diagonal_gaussian.mode()
+        self.assertEqual(mode.shape, self.parameters[:, :10].shape)
+        self.assertTrue(torch.allclose(mode, self.diagonal_gaussian.mean, atol=1e-1, rtol=1e-1))
+
+
+class TestCausalVAEModel(unittest.TestCase):
+    def setUp(self):
+        self.device = torch.device("npu" if torch.npu.is_available() else "cpu")
+        self.model = CausalVAEModel(attn_resolutions=(8,)).to(self.device)
+        self.path = 'test_checkpoint.pth'
+        torch.save(self.model.state_dict(), self.path)
+
+    def test_init(self):
+        self.assertIsInstance(self.model, CausalVAEModel)
+
+    def test_decode(self):
+        z = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        dec = self.model.decode(z)
+        self.assertIsInstance(dec, torch.Tensor)
+
+    def test_blend_v(self):
+        a = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        b = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        blend = self.model.blend_v(a, b, 8)
+        self.assertIsInstance(blend, torch.Tensor)
+
+    def test_blend_h(self):
+        a = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        b = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        blend = self.model.blend_h(a, b, 8)
+        self.assertIsInstance(blend, torch.Tensor)
+
+    def test_tiled_decode2d(self):
+        z = torch.randn(1, 4, 4, 8, 8).to(self.device)
+        dec = self.model.decode(z)
+        self.assertIsInstance(dec, torch.Tensor)
+
+    def test_enable_tiling(self):
+        self.model.enable_tiling()
+        self.assertTrue(self.model.use_tiling)
+
+    def test_disable_tiling(self):
+        self.model.disable_tiling()
+        self.assertFalse(self.model.use_tiling)
+    
+    def test_init_from_ckpt(self):
+        new_model = CausalVAEModel(attn_resolutions=(16,))
+        new_model.init_from_ckpt(self.path, ['loss'])
+        self.assertIsInstance(new_model, CausalVAEModel)
+
+    def tearDown(self):
+        os.remove(self.path)
+
+
+if __name__ == "__main__":
+    unittest.main(argv=['first-arg-is-ignored'])
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/spiece.model b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/spiece.model
new file mode 100644
index 0000000000000000000000000000000000000000..4e28ff6ebdf584f5372d9de68867399142435d9a
Binary files /dev/null and b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/spiece.model differ
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/test_opensora_plan_pipeline.py b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/test_opensora_plan_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb1c5e6fe5d88847b2a9fa7836dfd051a4652657
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/pipeline/test_opensora_plan_pipeline.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import os
+import sys
+import torch
+import torch_npu
+
+import colossalai
+from transformers import T5EncoderModel, T5Tokenizer, T5Config
+
+sys.path.append(os.path.split(sys.path[0])[0])
+
+from opensoraplan import OpenSoraPlanPipeline, CausalVAEModelWrapper, LatteT2V
+from opensoraplan import compile_pipe, get_scheduler, set_parallel_manager
+from opensoraplan import CacheConfig, OpenSoraPlanDiTCacheManager
+from opensoraplan.models.causalvae.modeling_causalvae import CausalVAEModel
+
+SEED = 5464
+MASTER_PORT = '42043'
+PROMPT = ["A cat playing with a ball"]
+
+
+class TestOpenSoraPlanPipeline(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        sp_size = int(os.getenv('SP_SIZE', '1'))
+        os.environ['WORLD_SIZE'] = f'{sp_size}'
+        if sp_size == 1:
+            os.environ['RANK'] = '0'
+            os.environ['LOCAL_RANK'] = '0'
+            os.environ['MASTER_ADDR'] = 'localhost'
+            os.environ['MASTER_PORT'] = MASTER_PORT
+
+        colossalai.launch_from_torch({}, seed=SEED)
+        set_parallel_manager(sp_size=sp_size, sp_axis=0)
+
+    def setUp(self):
+        torch.manual_seed(SEED)
+        torch.npu.manual_seed(SEED)
+        torch.npu.manual_seed_all(SEED)
+        torch.set_grad_enabled(False)
+        self.device_id = 0
+        self.device = "npu" if torch.npu.is_available() else "cpu"
+        self.num_frames = None
+        self.images = None
+        self.model_path = None
+
+    def test_pipeline_pndm(self):
+        pipeline = self._init("PNDM")
+        pipeline = compile_pipe(pipeline)
+        result = pipeline(prompt=PROMPT, num_inference_steps=5, guidance_scale=7.5).video
+        self.assertEqual(result.shape, torch.Size([1, 17, 256, 256, 3]))
+
+    def test_pipeline_patch_compress_ddpm(self):
+        pipeline = self._init("DDPM")
+        cache_manager = OpenSoraPlanDiTCacheManager(CacheConfig(1, 3, 1, 2, True))
+        # compile pipeline and set the cache_manager and cfg_last_step
+        pipeline = compile_pipe(pipeline, cache_manager, 3)
+        result = pipeline(prompt=PROMPT, num_inference_steps=5, guidance_scale=7.5).video
+
+        ratio = (
+            pipeline.transformer.cache_manager.all_block_num / pipeline.transformer.cache_manager.cal_block_num
+        )
+        self.assertGreater(ratio, 1.1)
+        self.assertEqual(result.shape, torch.Size([1, 17, 256, 256, 3]))
+
+    def _init(self, scheduler_type="PNDM"):
+        latent_size = (256 // 8, 256 // 8)
+        causal_vae_model = CausalVAEModel(attn_resolutions=[])
+        vae = CausalVAEModelWrapper(causal_vae_model, latent_size).eval()
+        t5_config = T5Config(
+            d_model=4096,
+            d_ff=10240,
+            num_layers=5,
+            num_decoder_layers=5,
+            num_heads=64,
+            feed_forward_proj="gated-gelu",
+            decoder_start_token_id=0,
+            dense_act_fn="gelu_new",
+            is_gated_act=True,
+            model_type="t5",
+            output_past=True,
+            tie_word_embeddings=False,
+        )
+        text_encoder = T5EncoderModel(t5_config).eval()
+        vocab_file_path = os.path.join(os.path.dirname(__file__), "spiece.model")
+        tokenizer = T5Tokenizer(vocab_file_path)
+        transformer = LatteT2V(
+            activation_fn="gelu-approximate",
+            attention_bias=True,
+            attention_head_dim=72,
+            attention_mode="xformers",
+            caption_channels=4096,
+            cross_attention_dim=1152,
+            in_channels=4,
+            norm_elementwise_affine=False,
+            norm_eps=1e-6,
+            norm_type="ada_norm_single",
+            num_embeds_ada_norm=1000,
+            num_layers=4,
+            out_channels=8,
+            patch_size=2,
+            sample_size=latent_size,
+            video_length=5,
+        ).to(self.device).eval()
+        schedular = get_scheduler(scheduler_type)
+
+        pipeline = OpenSoraPlanPipeline(
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            transformer=transformer,
+            vae=vae,
+            scheduler=schedular,
+            video_length=5,
+            image_size=256
+        )
+        return pipeline
+
+
+if __name__ == "__main__":
+    unittest.main(argv=['first-arg-is-ignored'])
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/run_test.sh b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/run_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0390fe5d8498f3104a58a2748780106e41128316
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/run_test.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+# Copyright(C) 2024. Huawei Technologies Co.,Ltd. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+set -e
+
+if command -v python3 &> /dev/null; then
+    python_command=python3
+else
+    python_command=python
+fi
+
+pip install coverage
+pip install colossalai==0.4.4 --no-deps
+pip install pytest
+pip install pytest-cov
+
+current_directory=$(dirname "$(readlink -f "$0")")
+export PYTHONPATH=${current_directory}/../:$PYTHONPATH
+
+pytest -k "test_ and not _test" --cov=../opensoraplan --cov-branch --cov-report xml --cov-report html \
+--junit-xml=${current_directory}/final.xml \
+--continue-on-collection-errors
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/t2v_sora.txt b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/t2v_sora.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6b73dea0fc15fa26c976faddbbe31e9fe410906e
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/t2v_sora.txt
@@ -0,0 +1,48 @@
+A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
+Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
+A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
+Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.
+Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
+A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.
+This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird’s head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird’s striking appearance.
+Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.
+A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.
+Historical footage of California during the gold rush.
+A close up view of a glass sphere that has a zen garden within it. There is a small dwarf in the sphere who is raking the zen garden and creating patterns in the sand.
+Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic
+A cartoon kangaroo disco dances.
+A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.
+A petri dish with a bamboo forest growing within it that has tiny red pandas running around.
+The camera rotates around a large stack of vintage televisions all showing different programs — 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.
+3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bushy, striped tail. It hops along a sparkling stream, its eyes wide with wonder. The forest is alive with magical elements: flowers that glow and change colors, trees with leaves in shades of purple and silver, and small floating lights that resemble fireflies. The creature stops to interact playfully with a group of tiny, fairy-like beings dancing around a mushroom ring. The creature looks up in awe at a large, glowing tree that seems to be the heart of the forest.
+The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.
+Reflections in the window of a train traveling through the Tokyo suburbs.
+A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast Italy, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography.
+A large orange octopus is seen resting on the bottom of the ocean floor, blending in with the sandy and rocky terrain. Its tentacles are spread out around its body, and its eyes are closed. The octopus is unaware of a king crab that is crawling towards it from behind a rock, its claws raised and ready to attack. The crab is brown and spiny, with long legs and antennae. The scene is captured from a wide angle, showing the vastness and depth of the ocean. The water is clear and blue, with rays of sunlight filtering through. The shot is sharp and crisp, with a high dynamic range. The octopus and the crab are in focus, while the background is slightly blurred, creating a depth of field effect.
+A flock of paper airplanes flutters through a dense jungle, weaving around trees as if they were migrating birds.
+A cat waking up its sleeping owner demanding breakfast. The owner tries to ignore the cat, but the cat tries new tactics and finally the owner pulls out a secret stash of treats from under the pillow to hold the cat off a little longer.
+Borneo wildlife on the Kinabatangan River
+A Chinese Lunar New Year celebration video with Chinese Dragon.
+Tour of an art gallery with many beautiful works of art in different styles.
+Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.
+A stop motion animation of a flower growing out of the windowsill of a suburban house.
+The story of a robot’s life in a cyberpunk setting.
+An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.
+A beautiful silhouette animation shows a wolf howling at the moon, feeling lonely, until it finds its pack.
+New York City submerged like Atlantis. Fish, whales, sea turtles and sharks swim through the streets of New York.
+A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
+Step-printing scene of a person running, cinematic film shot in 35mm.
+Five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing.
+Basketball through hoop then explodes.
+Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care.
+A grandmother with neatly combed grey hair stands behind a colorful birthday cake with numerous candles at a wood dining room table, expression is one of pure joy and happiness, with a happy glow in her eye. She leans forward and blows out the candles with a gentle puff, the cake has pink frosting and sprinkles and the candles cease to flicker, the grandmother wears a light blue blouse adorned with floral patterns, several happy friends and family sitting at the table can be seen celebrating, out of focus. The scene is beautifully captured, cinematic, showing a 3/4 view of the grandmother and the dining room. Warm color tones and soft lighting enhance the mood.
+The camera directly faces colorful buildings in Burano Italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.
+An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands, 3D digital render art style.
+This close-up shot of a chameleon showcases its striking color changing capabilities. The background is blurred, drawing attention to the animal’s striking appearance.
+A corgi vlogging itself in tropical Maui.
+A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field.
+Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere.
+Tiltshift of a construction site filled with workers, equipment, and heavy machinery.
+A giant, towering cloud in the shape of a man looms over the earth. The cloud man shoots lighting bolts down to the earth.
+A Samoyed and a Golden Retriever dog are playfully romping through a futuristic neon city at night. The neon lights emitted from the nearby buildings glistens off of their fur.
+The Glenfinnan Viaduct is a historic railway bridge in Scotland, UK, that crosses over the west highland line between the towns of Mallaig and Fort William. It is a stunning sight as a steam train leaves the bridge, traveling over the arch-covered viaduct. The landscape is dotted with lush greenery and rocky mountains, creating a picturesque backdrop for the train journey. The sky is blue and the sun is shining, making for a beautiful day to explore this majestic spot.
\ No newline at end of file
diff --git a/MindIE/MultiModal/OpenSoraPlan-1.0/tests/test_config_utils.py b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/test_config_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4a53798251341a7fbc4278cb84b7a9f50ac0059
--- /dev/null
+++ b/MindIE/MultiModal/OpenSoraPlan-1.0/tests/test_config_utils.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright(C) 2024. Huawei Technologies Co.,Ltd. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+import unittest
+import logging
+import sys
+import os
+import json
+sys.path.append('../')
+from mindiesd.config_utils import ConfigMixin
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+
+USED_KEY = "used_key"
+NOUSED_KEY = "noused_key"
+CONFIG_NAME = "./configs"
+
+
+class ModelConfig(ConfigMixin):
+    config_name = "configmixin.json"
+
+    def __init__(self, used_key):
+        self.used_key = used_key
+
+
+class InvalidModelConfig(ConfigMixin):
+    config_name = "invalid.json"
+
+    def __init__(self, used_key):
+        self.used_key = used_key
+
+
+class TestConfigMixin(unittest.TestCase):
+
+    def test_load_config(self):
+        init_dict, config_dict = ModelConfig.load_config(CONFIG_NAME)
+        # used_key will in init_dict
+        self.assertIn(USED_KEY, init_dict)
+        self.assertEqual(init_dict.get(USED_KEY), USED_KEY)
+        self.assertNotIn(USED_KEY, config_dict)
+
+        # noused_key will in config_dict
+        self.assertIn(NOUSED_KEY, config_dict)
+        self.assertEqual(config_dict.get(NOUSED_KEY), NOUSED_KEY)
+        self.assertNotIn(NOUSED_KEY, init_dict)
+
+    def test_config_path_invalid(self):
+        try:
+            init_dict, config_dict = ModelConfig.load_config("./no_used_path")
+        except Exception as e:
+            logger.error(e)
+            init_dict, config_dict = None, None
+        self.assertIsNone(init_dict)
+        self.assertIsNone(config_dict)
+    
+    def test_config_path_none(self):
+        try:
+            init_dict, config_dict = ConfigMixin.load_config(CONFIG_NAME)
+        except Exception as e:
+            logger.error(e)
+            init_dict, config_dict = None, None
+        self.assertIsNone(init_dict)
+        self.assertIsNone(config_dict)
+    
+    def test_config_json_invalid(self):
+        try:
+            init_dict, config_dict = InvalidModelConfig.load_config(CONFIG_NAME)
+        except Exception as e:
+            logger.error(e)
+            init_dict, config_dict = None, None
+        self.assertIsNone(init_dict)
+        self.assertIsNone(config_dict)
+        
+
+if __name__ == '__main__':
+    unittest.main()
\ No newline at end of file