modellink代码整改

2024-01-10 17:27:24 +08:00 · 2024-01-10 17:27:24 +08:00 · 6991ef1950
parent 931d0c258a
commit 6991ef1950
367 changed files with 199 additions and 55491 deletions
--- a/examples/alpaca/README.md
+++ b/examples/alpaca/README.md
@ -1,329 +0,0 @@
-# Chinese-LLaMA-Alpaca
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/alpaca/README_en.md">English</a> </b> 
-    </p>
-</p>
-
-
-
-这里介绍如何快速复现 AscendSpeed 中 [Chinese-LLaMA-Alpaca](https://arxiv.org/abs/2304.08177) 的测试结果
-
-> Cui, Yang, and Yao, et al. "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca." arXiv preprint arXiv:2304.08177 (2023).
-
-这里有Llama-alpaca的[实战教程](https://gitee.com/ascend/AscendSpeed/wikis/AscendSpeed%20Guide/AscendSpeed%20%E4%B8%AD%E6%96%87%E6%95%99%E7%A8%8B)
-
-# 目录
-
- [模型权重](#模型权重)
-
- [合并权重](#合并权重)
-
- [微调](#微调)
-
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-
-    - [吞吐](#吞吐)
-    - [精度](#精度)
-
- [推理](#推理)
-  - [脚本](#脚本)
-
- [评估](#评估)
-
-  - [举例](#举例)
-
- [引用](#引用)
-
-  
-
-# 模型权重
-
-
-下载 [原始 LLaMA 模型](https://github.com/facebookresearch/llama) 权重, 并下载 [Chinese-LLaMA-Alpaca 模型的 LoRA](https://github.com/ymcui/Chinese-LLaMA-Alpaca) 权重, 这里的 Lora 权重可以理解为是LLaMA权重的一个”补丁”
-
-# 合并权重
-在合并权重前，请先确认机器是否有足够的内存加载模型权重，比如 7B的模型就要求13~15G 内存，同时基于[SHA256](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md)检查权重的完整性以确保合并成功。
-原始 LLaMA 文件包括: tokenizer.model, tokenizer_checklist.chk, consolidated.*.pth, params.json等 
-
-#### 步骤 1: [将原始 LLaMA 模型转化为 huggingface 的格式](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-1-%E5%B0%86%E5%8E%9F%E7%89%88llama%E6%A8%A1%E5%9E%8B%E8%BD%AC%E6%8D%A2%E4%B8%BAhf%E6%A0%BC%E5%BC%8F)
-请使用 Transformers 提供的 [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) 脚本将 LLAMA 模型权重转化为 `huggingface` 的格式
-
-```
-python convert_llama_weights_to_hf.py \
-    --input_dir path_to_original_llama_root_dir \
-    --model_size 7B \
-    --output_dir path_to_original_llama_hf_dir
-```
-
-新的 huggingface 模型文件生成在 `--output_dir` 目录下，如下：
-
-```
-config.json
-generation_config.json
-pytorch_model-00001-of-00002.bin
-pytorch_model-00002-of-00002.bin
-pytorch_model.bin.index.json
-special_tokens_map.json
-tokenizer_config.json
-tokenizer.json
-tokenizer.model
-```
-
-#### 步骤 2: [结合 LoRA 权重生成完整模型权重](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-2-%E5%90%88%E5%B9%B6lora%E6%9D%83%E9%87%8D%E7%94%9F%E6%88%90%E5%85%A8%E9%87%8F%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D)
-
-**单个 LoRA 权重合并** (可应用于 Chinese-LLaMA, Chinese-LLaMA-Plus, Chinese-Alpaca).
-
-下载脚本 [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), 并执行:
-```
-python merge_llama_with_chinese_lora.py \
-    --base_model path_to_original_llama_hf_dir \
-    --lora_model path_to_chinese_llama_or_alpaca_lora \
-    --output_type huggingface \
-    --output_dir path_to_merged_hf_dir 
-```
-参数说明:
-
- `--base_model`： 存放 HF格式 LLaMA 模型和配置文件的目录 (步骤 1 中生成).
- `--lora_model`： 存放 Chinese LLAMA/Alpaca LoRA 解压文件的目录 
- `--output_type`： 明确输出格式，可以是 `pth` or `huggingface`，默认为 `pth`.
- `--output_dir`：明确输出文件保存目录，默认为 `./`.
-
-**多 LoRA 权重合并** (可应用于 Chinese-Alpaca-Plus 和 Chinese-Alpaca-Pro). 
-
-下载脚本 [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), 并执行:
-```
-python merge_llama_with_chinese_lora.py \
-    --base_model path_to_original_llama_hf_dir \
-    --lora_model path_to_chinese_llama_plus_lora,path_to_chinese_alpaca_plus_lora \
-    --output_type huggingface \
-    --output_dir path_to_merged_hf_dir 
-```
-
-#### 步骤 3: 合并后检查 SHA256
-
-权重合并后请检查 [SHA256](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md)，由于 HF格式的 SHA256 经常发生变化，一般推荐先转化为 `pth` 格式，在确认 SHA256 正确以后，再根据需要转换为 HF 格式。
-
-#### 步骤 4: 将HF的权重格式转化为AscendSpeed格式
-
-如果要使用AscendSpeed的张量和流水并行策略，执行：
-
-```
-python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
-    --input-model-dir path_to_merged_hf_dir \
-    --output-model-dir path_to_merged_ascendspeed_dir \
-    --tensor-model-parallel-size 4 \
-    --pipeline-model-parallel-size 2 \
-    --type 7B                                                                    
-```
-如果要使用AscendSpeed中DeepSpeed的并行策略，执行:
-
-```
-python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
-    --input-model-dir path_to_merged_hf_dir \
-    --output-model-dir path_to_merged_ascendspeed_dir \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B  \ 
-    --deepspeed                                                                 
-```
-
-
-# 微调
-## 训练
-Chinese LLaMA Alpaca-13B 微调的硬件配置:
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| CPU | 4xKunPeng920@3.0GHz，64 Core Pre Socket 256CPUS |
-| RAM |               32x64 GB DDR4                     |
-| NPU | 8 x Ascend NPUs |
-
-
-## 脚本
-
-
-1. 拷贝代码仓到本地服务器
-
-```bash
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0rc1.post_XXXXXX-cp38-cp38-linux_aarch64.whl
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-
-# 安装其他包
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-3. 准备数据集
-```bash
-# 对于llama，下载 alpaca 数据集并将其放入 $DATA_PATH, 比如
-wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json
-
-# 下载 tokenizer 配置 
-# https://huggingface.co/yahma/llama-7b-hf/tree/main
-# 将 tokenizer_config.json 文件中的 "LLaMATokenizer" 修改为 "LlamaTokenizer" (这是 huggingface 的一个bug)
-# 将 tokenizer 文件放在 $TOKENIZER_PATH
-mkdir dataset
-python tools/preprocess_data.py --input alpaca_data.json \
-                                --output-prefix $DATA_PATH \
-                                --tokenizer-type PretrainedFromHF \
-                                --tokenizer-name-or-path $TOKENIZER_PATH \
-                                --tokenizer-not-use-fast \
-                                --handler-name GeneralInstructionHandler
-```
-
-4. 配置 Chinese-LLaMA-Alpaca 微调脚本 
-
-通过设置 `$MODEL_PATH` 变量区分 7B/13B/33B 参数，比如，当 `$MODEL_PATH` 入参的字符串可以匹配为 `*7b*` 时，脚本便会使用 7B的参数
-
-* 基于torch拉起任务的启动脚本为 : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh)
-
-```bash
-bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
-```
-
-* 基于deepspeed拉起任务的启动脚本为 : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh)
-
-```bash
-bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
-```
-
-## 性能
-
-### 吞吐
-
-以下是 Chinese LLaMA Alpaca-13B 在昇腾芯片和参考芯片上的吞吐对比，参数配置参考:finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh：
-
-|  芯片  |            模型            | 迭代次数 | 样本吞吐 (samples/s/p) | token吞吐 (tokens/s/p) | 浮点计算次数 (TFLOPs/s) |
-|:----:|:------------------------:|:----:|:------------------:|:--------------------:|:-----------------:|
-| GPUs | Chinese LLaMA Alpaca-13B | 3000 |        5.83        |       1493.73        |      153.91       |
-| NPUs | Chinese LLaMA Alpaca-13B | 3000 |        6.59         |       1687.04        |      183.81       |
-
-
-
-### 精度
-
-NPU vs GPU loss.
-
-![NPU-LOSS](../../sources/images/alpaca/13b_lm_loss.png)
-
-NPU vs GPU loss 相对误差.
-
-
-![NPU-Relative-Error](../../sources/images/alpaca/relative_error.png)
-
-
-## 推理
-
-AscendSpeed 当前支持 Chinese LLaMA Alpaca-13B 的文本生成推理
-
-### 脚本
-
-推理脚本中配置路径参数：[examples/alpaca/generate_alpaca_13B_deepspeed.sh](examples/alpaca/generate_alpaca_13B_deepspeed.sh)
-
-```shell
-# 修改模型权重和tokenizer词表路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-```shell
-bash examples/alpaca/generate_alpaca_13B_deepspeed.sh
-```
-
-## 评估
-
-CEVAL评估举例，数据集[下载](https://cevalbenchmark.com/)。
-
-```shell
-    CHECKPOINT=./ckpt/
-    VOCAB_FILE=./alpaca-plus-13b/
-    # 配置任务和数据集路径
-    DATA_PATH="./ceval/data/test/"
-    TASK="ceval"
-    # configure generation parameters 
-    python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task $TASK\
-       --seq-length 2048 \
-       --max-new-tokens 2 \
-       --max-position-embeddings 2048 \
-       --tensor-model-parallel-size 4  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --load ${CHECKPOINT}  \
-       --num-attention-heads 40  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path $VOCAB_FILE \
-       --tokenizer-not-use-fast \
-       --fp16  \
-       --micro-batch-size 1  \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --seed 42 | tee logs/eval_alpaca-13b.log
-```
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Subset</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://cevalbenchmark.com/">Ceval</a></td>
-      <td>Val</td>
-      <th>alpaca plus 13B</th>
-      <td>0.408</td>
-      <td><a href="https://github.com/ymcui/Chinese-LLaMA-Alpaca">0.415</a></td>
-    </tr>
-  </tbody>
-</table>
-
-
-## 举例
-Chinese LLaMA Alpaca-13B:
-
-![alpaca_13b_generate.png](../../sources/images/alpaca/alpaca_13b_generate.png)
-
-
-
-# 引用
-
-```
-@article{chinese-llama-alpaca,
-      title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca}, 
-      author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
-      journal={arXiv preprint arXiv:2304.08177},
-      url={https://arxiv.org/abs/2304.08177},
-      year={2023}
-}
-```
--- a/examples/alpaca/README_en.md
+++ b/examples/alpaca/README_en.md
@ -1,347 +0,0 @@
-# Chinese-LLaMA-Alpaca
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/alpaca/README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-
-This directory contains scripts used to produce the results of Chinese-LLaMA-Alpaca in AscendSpeed.
-
-Chinese-LLaMA-Alpaca model is from: [Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca](https://arxiv.org/abs/2304.08177)
-
-> Cui, Yang, and Yao, et al. "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca." arXiv preprint arXiv:2304.08177 (2023).
-
-Here is Llama alpaca's [guide](https://gitee.com/ascend/AscendSpeed/wikis/AscendSpeed%20Guide/AscendSpeed%20%E4%B8%AD%E6%96%87%E6%95%99%E7%A8%8B)
-
-
-# Contents
-
- [Contents](#contents)
-
- [Model Weights](#model-Weights)
-
- [Merge Model](#merge-Model)
-
- [Fine-tune](#fine-tune)
-
-  - [Training](#training)
-  - [Script](#script)
-  - [Performance](#performance)
-
-    - [Machine performance](#machine-performance)
-    - [Accuracy of the loss](#accuracy-of-the-loss)
-
-  - [Inference](#Inference)
-    - [Script](#script)
-
-  - [Evaluation](#Evaluation)
-
-  - [Example](#example)
-
- [Citation](#citation)
-
-  
-
-# Model Weights
-
-
-First download the [original LLaMA model](https://github.com/facebookresearch/llama) weights, then download the [Chinese-LLaMA-Alpaca model](https://github.com/ymcui/Chinese-LLaMA-Alpaca) LoRA weight, which can be understood as a "patch" on the original LLaMA model. And then merge the original LLaMA model with it to obtain a complete weight.
-
-# Merge Weights
-
-Before merging weights, please ensure that the machine has enough memory to load the complete model weights (for example, 7B model requires 13-15G) for the merge model operation. And confirm the integrity of the base model and the downloaded LoRA model, and check whether they are consistent with the values shown in SHA256.md, otherwise the merge operation cannot be performed. The original LLaMA includes: tokenizer.model, tokenizer_checklist.chk, consolidated.*.pth, params.json. 
-
-#### Step 1: [Convert the original LLaMA model to HF format.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-1-%E5%B0%86%E5%8E%9F%E7%89%88llama%E6%A8%A1%E5%9E%8B%E8%BD%AC%E6%8D%A2%E4%B8%BAhf%E6%A0%BC%E5%BC%8F)
-Please use the script [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) provided by Transformers to convert the original LLAMA model to `huggingFace` format. 
-```
-python convert_llama_weights_to_hf.py \
-    --input_dir path_to_original_llama_root_dir \
-    --model_size 7B \
-    --output_dir path_to_original_llama_hf_dir
-```
-
-Model files in HF format will be generated in the `--output_dir` directory, such as:
-
-```
-config.json
-generation_config.json
-pytorch_model-00001-of-00002.bin
-pytorch_model-00002-of-00002.bin
-pytorch_model.bin.index.json
-special_tokens_map.json
-tokenizer_config.json
-tokenizer.json
-tokenizer.model
-```
-
-#### Step 2: [Combine LoRA weights to generate full model weights.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-2-%E5%90%88%E5%B9%B6lora%E6%9D%83%E9%87%8D%E7%94%9F%E6%88%90%E5%85%A8%E9%87%8F%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D)
-
-This step will expand the Chinese vocabulary of the original LLaMA model (HF format), merge the LoRA weights and generate the full model weights. Here you can choose to output the PyTorch version weight (.pth file) or HuggingFace version weight (.bin file). Please convert it to pth file first, compare the SHA256 of the merged model and then convert it to HF format as needed. 
-
-**Single LoRA weight merging** (applicable to Chinese-LLaMA, Chinese-LLaMA-Plus, Chinese-Alpaca). 
-Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
-```
-python merge_llama_with_chinese_lora.py \
-    --base_model path_to_original_llama_hf_dir \
-    --lora_model path_to_chinese_llama_or_alpaca_lora \
-    --output_type huggingface \
-    --output_dir path_to_merged_hf_dir 
-```
-Parameter Description:
-
- `--base_model`：Directory to store LLAMA model weights and configuration files in HF format (STEP 1 generation).
- `--lora_model`：Directory where the Chinese LLAMA/Alpaca LoRA decompressed files are located. 
- `--output_type`: Specify the output format, which can be `pth` or `huggingface`. If it is not specified, the default is `pth`.
- `--output_dir`：Specify the directory of saving full model weight, default `./`.
- (Optional)`--offload_dir`(Only valid for old script `scripts/merge_llama_with_chinese_lora.py`)：For low memory users, you need to specify an Office cache path.
- (Optional)`--verbose`(Only valid for new script `scripts/merge_llama_with_chinese_lora_low_mem.py`)：Display the detailed information during the merge process.
-
-
-**Multi-LoRA weight merging** (applicable to Chinese-Alpaca-Plus and Chinese-Alpaca-Pro). 
-Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
-```
-python merge_llama_with_chinese_lora.py \
-    --base_model path_to_original_llama_hf_dir \
-    --lora_model path_to_chinese_llama_plus_lora,path_to_chinese_alpaca_plus_lora \
-    --output_type huggingface \
-    --output_dir path_to_merged_hf_dir 
-```
-
-#### Step 3: Check SHA256 after merge.
-
- Be sure to check [SHA256](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md) after the merge is complete. It is recommended to convert to pth format first, and after comparing the SHA256 is correct, then convert to HF format if necessary, because the model SHA256 corresponding to the HF format often changes (meta information changes).
-
-#### Step 4: Convert ckpt from huggingface format to model parallel format. 
-
-Based on megatron launcher, execute the following command:
-
-```
-python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
-    --input-model-dir path_to_merged_hf_dir \
-    --output-model-dir path_to_merged_ascendspeed_dir \
-    --tensor-model-parallel-size 4 \
-    --pipeline-model-parallel-size 2 \
-    --type 7B                                                                    
-```
-Based on deepspeed launcher, execute the following command:
-
-```
-python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
-    --input-model-dir path_to_merged_hf_dir \
-    --output-model-dir path_to_merged_ascendspeed_dir \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B  \ 
-    --deepspeed                                                                 
-```
-
-
-# Fine-tune
-## Training
-Here's a hardware summary of fine-tuning Chinese LLaMA Alpaca-13B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   CPU    | 4xKunPeng920@3.0GHz，64 Core Pre Socket 256CPUS |
-|   RAM    |                  32x64 GB DDR4                  |
-|   NPU    |               8 x Ascend910 64G                |
-
-
-
-## Script
-
-
-1. Clone the repository to your local server
-
-```bash
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0rc1.post_XXXXXX-cp38-cp38-linux_aarch64.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-
-# install other packages
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-3. Prepare dataset
-```bash
-# for llama, download alpaca dataset and save it into $DATA_PATH, like
-wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json
-
-# download tokenizer configs and (selective) weights from 
-# https://huggingface.co/yahma/llama-7b-hf/tree/main
-# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
-# save the downloaded tokenizer into $TOKENIZER_PATH
-mkdir dataset
-python tools/preprocess_data.py --input alpaca_data.json \
-                                --output-prefix $DATA_PATH \
-                                --tokenizer-type PretrainedFromHF \
-                                --tokenizer-name-or-path $TOKENIZER_PATH \
-                                --tokenizer-not-use-fast \
-                                --handler-name GeneralInstructionHandler
-```
-
-4. Config Chinese-LLaMA-Alpaca fine-tune script 
-
-Parameters of 7B/13B/33B are distinguished through `$MODEL_PATH`. For example, if `$MODEL_PATH` matches `*7b*`, then using the parameter of 7B.
-
-* Based on PyTorch's built-in distributed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh)
-
-```bash
-bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
-```
-
-* Based on Deepspeed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh)
-
-```bash
-bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
-```
-
-## Performance
-
-### Machine performance
-
-The performance of the Chinese LLaMA Alpaca-13B in **Ascend910 NPUs** and **A100 GPUs**:
-Parameter Configuration Reference:finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
-
-|  Device  |   Model   | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | floating point operation (TFLOPs/s) |
-| :------: | :-------: | :--------------: |:-----------------------------:|:----------------------------:|:-----------------------------------:|
-|   GPUs   | Chinese LLaMA Alpaca-13B |       3000        |             5.83              |           1493.73            |               153.91                |
-|   NPUs   | Chinese LLaMA Alpaca-13B |       3000        |             6.59              |           1687.04            |               183.81                |
-
-
-
-### Accuracy of the loss
-
-NPU vs GPU loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
-
-![NPU-LOSS](../../sources/images/alpaca/13b_lm_loss.png)
-
-NPU vs GPU loss relative error.
-
-The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.
-
-![NPU-Relative-Error](../../sources/images/alpaca/relative_error.png)
-
-
-## Inference
-
-We support AscendSpeed Inference for text generation with Chinese LLaMA Alpaca-13B.
-
-### Script
-
-We generate text samples using the `generate_alpaca` script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:
-
-Config Chinese LLaMA Alpaca-13B inference script: [examples/alpaca/generate_alpaca_13B_deepspeed.sh](examples/alpaca/generate_alpaca_13B_deepspeed.sh)
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-```shell
-bash examples/alpaca/generate_alpaca_13B_deepspeed.sh
-```
-
-## Evaluation
-
-CEVAL assessment examples, datasets download from [here](https://cevalbenchmark.com/).
-
-```shell
-    CHECKPOINT=./ckpt/
-    VOCAB_FILE=./alpaca-plus-13b/
-    # Configure the task and data set path.
-    DATA_PATH="./ceval/data/test/"
-    TASK="ceval"
-    # configure generation parameters 
-    python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task $TASK\
-       --seq-length 2048 \
-       --max-new-tokens 2 \
-       --max-position-embeddings 2048 \
-       --tensor-model-parallel-size 4  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --load ${CHECKPOINT}  \
-       --num-attention-heads 40  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path $VOCAB_FILE \
-       --tokenizer-not-use-fast \
-       --fp16  \
-       --micro-batch-size 1  \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --seed 42 | tee logs/eval_alpaca-13b.log
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>验证集</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://cevalbenchmark.com/">Ceval</a></td>
-      <td>Val</td>
-      <th>alpaca plus 13B</th>
-      <td>0.408</td>
-      <td><a href="https://github.com/ymcui/Chinese-LLaMA-Alpaca">0.415</a></td>
-    </tr>
-  </tbody>
-</table>
-
-## Example
-Chinese LLaMA Alpaca-13B:
-
-![alpaca_13b_generate.png](../../sources/images/alpaca/alpaca_13b_generate.png)
-
-All the provided scripts are tested on 910 64GB NPUs for Chinese LLaMA Alpaca-13B(fp16). These scripts might not work for other models or a different number of NPUs.
-
-> Note: Sometimes NPUs memory is not freed when inference deployment crashes. You can free this memory by running kill all python in terminal.
-
-
-# Citation
-
-You may also consider original work in your reference:
-
-```
-@article{chinese-llama-alpaca,
-      title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca}, 
-      author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
-      journal={arXiv preprint arXiv:2304.08177},
-      url={https://arxiv.org/abs/2304.08177},
-      year={2023}
-}
-```
--- a/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
+++ b/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
@ -1,136 +0,0 @@
-# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights based on deepspeed launcher, configured with tensor model parallel size of 1, pipeline model parallel size of 1.
-# add HCCL_OP_BASE_FFTS_MODE_ENABLE
-export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
-
-# modify the script according to your own conda and ascend-toolkit path
-export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-SAVE_PATH=<save-path>
-# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
-TOKENIZER_PATH=<tokenizer-path>
-DATA_PATH=<data-path>
-# your own merged model path
-MODEL_PATH=<model-path>
-
-DS_CONFIG=deepspeed_config_13B.json
-ZERO_STAGE=2
-MICRO_BATCH=2
-GRADIENT_ACCUMULATION_STEP=8
-GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
-TRAIN_ITERS=3000
-SAVE_INTERVAL=$(($TRAIN_ITERS / 4))
-
-# 7b/13b/33b models use the following parameters respectively 
-if [[ "$MODEL_PATH" == *13[Bb]* ]]; then
-  num_layers=40
-  hidden_size=5120
-  ffn_hidden_size=13824
-  num_heads=40
-elif [[ "$MODEL_PATH" == *33[Bb]* ]]; then
-  num_layers=60
-  hidden_size=6656
-  ffn_hidden_size=17920
-  num_heads=52
-else
-  num_layers=32
-  hidden_size=4096
-  ffn_hidden_size=11008
-  num_heads=32
-fi
-
-# This is the configuration for deepspeed
-cat <<EOT > $DS_CONFIG
-{
-    "fp16": {
-        "enabled": true,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "optimizer": {
-        "type": "Adam"
-    },
-
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "allgather_partitions": true,
-        "allgather_bucket_size": 1e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 1e8,
-        "contiguous_gradients": true
-    },
-
-    "gradient_accumulation_steps": $GRADIENT_ACCUMULATION_STEP,
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true
-}
-EOT
-
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}"
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
-
-deepspeed pretrain_llama.py \
-         --DDP-impl local \
-         --no-contiguous-buffers-in-local-ddp \
-         --is-instruction-dataset \
-         --tensor-model-parallel-size 1 \
-         --pipeline-model-parallel-size 1 \
-         --num-layers $num_layers \
-         --hidden-size $hidden_size \
-         --ffn-hidden-size $ffn_hidden_size \
-         --num-attention-heads $num_heads \
-         --micro-batch-size $MICRO_BATCH \
-         --global-batch-size $GLOBAL_BATCH \
-         --seq-length 2048 \
-         --max-position-embeddings 2048 \
-         --train-iters $TRAIN_ITERS \
-         --lr-decay-iters $TRAIN_ITERS \
-         --save $SAVE_PATH \
-         --load $MODEL_PATH \
-         --data-path $DATA_PATH \
-         --tokenizer-name-or-path $TOKENIZER_PATH \
-         --tokenizer-not-use-fast \
-         --checkpoint-activations \
-         --recompute-method block \
-         --data-impl mmap \
-         --split 949,50,1 \
-         --distributed-backend nccl \
-         --lr 1e-6 \
-         --lr-decay-style cosine \
-         --min-lr 0 \
-         --weight-decay 1e-2 \
-         --clip-grad 1.0 \
-         --lr-warmup-iters 200 \
-         --log-interval 1 \
-         --save-interval $SAVE_INTERVAL \
-         --eval-interval 1000 \
-         --eval-iters 10 \
-         --use-cpu-initialization \
-         --use-flash-attn \
-         --use-fused-rmsnorm \
-         --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-         --lora-r 64 \
-         --lora-alpha 128 \
-         $ds_args \
-         --position-embedding-type rope \
-         --normalization RMSNorm \
-         --fp16 | tee logs/train.log
--- a/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
+++ b/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
@ -1,87 +0,0 @@
-# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights, configured with tensor model parallel size of 4, pipeline model parallel size of 2.
-# modify the script according to your own conda and ascend-toolkit path
-export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-NPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
-TOKENIZER_PATH=<tokenizer-path>
-DATA_PATH=<data-path>
-# your own merged model path
-MODEL_PATH=<model-path>
-
-GLOBAL_BATCH=250
-MICRO_BATCH=2
-
-# Distributed setting
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# 7b/13b/33b models use the following parameters respectively 
-if [[ "$MODEL_PATH" == *13[Bb]* ]]; then
-  num_layers=40
-  hidden_size=5120
-  ffn_hidden_size=13824
-  num_heads=40
-elif [[ "$MODEL_PATH" == *33[Bb]* ]]; then
-  num_layers=60
-  hidden_size=6656
-  ffn_hidden_size=17920
-  num_heads=52
-else
-  num_layers=32
-  hidden_size=4096
-  ffn_hidden_size=11008
-  num_heads=32
-fi
-
-python -m torch.distributed.launch ${DISTRIBUTED_ARGS} \
-         pretrain_llama.py \
-         --no-contiguous-buffers-in-local-ddp \
-         --DDP-impl local \
-         --is-instruction-dataset \
-         --tensor-model-parallel-size 4 \
-         --pipeline-model-parallel-size 2 \
-         --num-layers $num_layers \
-         --hidden-size $hidden_size \
-         --ffn-hidden-size $ffn_hidden_size \
-         --num-attention-heads $num_heads \
-         --micro-batch-size $MICRO_BATCH \
-         --global-batch-size $GLOBAL_BATCH \
-         --seq-length 2048 \
-         --max-position-embeddings 2048 \
-         --train-iters 1000 \
-         --lr-decay-iters 320000 \
-         --load $MODEL_PATH \
-         --data-path $DATA_PATH \
-         --tokenizer-name-or-path $TOKENIZER_PATH \
-         --tokenizer-not-use-fast \
-         --data-impl mmap \
-         --initial-loss-scale 4096 \
-         --split 949,50,1 \
-         --distributed-backend nccl \
-         --lr 1e-4 \
-         --lr-decay-style cosine \
-         --min-lr 1.0e-5 \
-         --weight-decay 1e-2 \
-         --clip-grad 1.0 \
-         --lr-warmup-fraction .01 \
-         --eval-iters 10 \
-         --log-interval 1 \
-         --save-interval 10000 \
-         --position-embedding-type rope \
-         --normalization RMSNorm \
-         --use-flash-attn \
-         --use-fused-rmsnorm \
-         --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-         --lora-r 32 \
-         --lora-alpha 64 \
-         --fp16 | tee logs/train.log
--- a/examples/alpaca/generate_alpaca_13B_deepspeed.sh
+++ b/examples/alpaca/generate_alpaca_13B_deepspeed.sh
@ -1,56 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your origin deepspeed checkpoint path (TP=1, PP=1)"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=2
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_alpaca.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --no-pipeline-parallel \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/alpaca/generate_alpaca_13B_lora_deepspeed.sh
+++ b/examples/alpaca/generate_alpaca_13B_lora_deepspeed.sh
@ -1,61 +0,0 @@
-#!/bin/bash
-
-export TOKENIZERS_PARALLELISM=false
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your origin deepspeed checkpoint path (TP=1, PP=1)"
-LORA_CHECKPOINT="your lora checkpoint path"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_alpaca.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --lora-load "${LORA_CHECKPOINT}" \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --no-pipeline-parallel \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/aquila/README.md
+++ b/examples/aquila/README.md
@ -1,214 +0,0 @@
-# Aquila-7B
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/aquila/README_en.md">English</a></b>
-    </p>
-<p></p>
-
-
-
-
- [Aquila-7B](#Aquila-7B)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-    - [推理](#推理)
-    - [使用基线数据集进行评估](#使用基线数据集进行评估)
-
-## 训练
-
-Aquila-7B 训练的硬件配置如下:
-
-
-|  硬件 |  设备 |        配置       |
-|:---:|:---:|:---------------:|
-| NPU | A+K | 8 x Ascend NPUs |
-
-### 脚本
-
-#### 1. 克隆git仓库到个人服务器
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-mkdir logs
-mkdir ckpt
-```
-
-#### 2. 搭建conda环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-# 通过互联网上提供的pip源安装 torch，可能需要尝试合适的包含这个torch==2.1.0版本的pip源
-pip install torch==2.1.0
-# 通过PTA上提供的安装包，以whl文件方式安装aarch64架构上的2.1.0版本的torch_npu
-pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
-# 源码方式安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# 安装 deepspeed
-pip install deepspeed==0.9.2
-# 源码方式安装 deepspeed_npu
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# 进入AscendSpeed主目录，安装其余依赖包
-pip install -r requirements.txt
-```
-
-#### 3. 使用浏览器下载 [Aquila-7B模型的配置，tokenizer，和预训练权重](https://huggingface.co/BAAI/Aquila-7B/tree/main)
-
-保存在 AscendSpeed/HF_Aquila7B_downloaded/ 目录。
-
-
-#### 4. 数据预处理
-
-第一步，使用浏览器 [下载数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)， 保存在AscendSpeed/dataset/ 目录
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-第二步，使用Aquila-7B指定的tokenizer处理数据集：
-
-```shell
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./HF_Aquila7B_downloaded/ \
-    --output-prefix ./dataset/aquila \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --make-vocab-size-divisible-by 8 \
-    --pad-vocab-size-to 100008 \
-    --append-eod
-```
-
-#### 5. 权重转换
-
-请注意，如果要在NPU上加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
-
-第一步，要修改一个bug：
-```shell
-# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
-# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
-
-# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None:
-    return False
-
-# 修改后
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None or len(zero_sd_list) == 0:
-    return False
-```
-
-第二步，将模型权重文件从 huggingface 格式转化为 AscendSpeed 格式
-
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./HF_Aquila7B_downloaded/ \
-    --output-model-dir ./model_weights/aquila \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --make-vocab-size-divisible-by 8 \
-    --type 7B \
-    --deepspeed
-```
-
-
-#### 6. 配置 Aquila-7B 预训练脚本
-
-```shell
-# 设置 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-# 修改数据集路径，权重路径，词表路径等
-TOKENIZER_PATH=./HF_Aquila7B_downloaded  #tokenizer 路径
-DATA=./dataset/aquila_text_document  #数据集 路径
-CHECKPOINT=./model_weights/
-
-# 如果不需要加载权重，就移除 `--load` 参数
-```
-
-#### 7. 启动 Aquila-7B 预训练脚本
-注意，如果启动训练后出现protoc版本问题的报错，只要卸载protobuf,安装pip install protobuf==3.19.0即可解决。
-
-按以下方式启动训练：
-Aquila-7B
-```shell
-bash examples/aquila/pretrain_aquila_7B.sh
-```
-
-
-### 性能
-
-#### 吞吐
-
-Aquila-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 | 硬件          | 模型       | 迭代数| 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|------|---------------|------------|------|------------------|----------------------|-----------------|------------------|
-| NPU  | 910b 1node*8p | Aquila-7B  | 1024 | 13.260            | 3394.56              | 4.8266           | 148.41           |
-| 参考  |  | Aquila-7B  |  |           | 4078             |           |          |
-
-
-
-
-#### 精度
-
-Aquila-7b NPU vs 参考 loss.
-
-![NPU-GPU-Relative-Error](../../sources/images/aquila/aquila_comp1130.png)
-
-
-## 推理
-
-我们支持使用 Aquila-7B进行文本生成的推理。
-
-推理与预训练不同，我们需要加载预训练权重，因此需要先完成前面叙述的转换权重的工作。然后配置Aquila-7B推理脚本`examples/aquila/generate_aquila_7B.sh`，CHECKPOINT要指向转换后的权重，而VOCAB_FILE要指向含有Aquila词表文件的目录，在我们的例子中，即下载权重后保存的位置"./HF_Aquila7B_downloaded"。在您实际操作中，请根据情况填入正确的值。
-
-```shell
-# 修改模型权重路径和分词器路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-启动Aquila-7B推理:
-```shell
-bash ./examples/aquila/generate_aquila_7B.sh
-```
-
-部分推理样本如下：
-
-Aquila-7B:
-
-![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate.png)
-
-
-## 使用基线数据集进行评估
-
-我们使用 BoolQ benchmark 来评估我们的模型。在[Benchmark下载页面](https://github.com/google-research-datasets/boolean-questions)找到[数据集](https://storage.cloud.google.com/boolq/dev.jsonl)下载后保存。例如，保存在AscendSpeed/boolq/test目录下。
-
-评估与推理类似，也需要加载转换后的权重。如下使用Aquila-7B推理脚本`tasks/evaluation/eval_aquila_7B.sh`：
-
-```shell
-# 开始评估
-bash tasks/evaluation/eval_aquila_7B.sh
-```
-
-Aquila-7B在**Ascend NPU**中的评测表现：
-
-| 任务                                                                   | 模型       | 昇腾值|社区值|
-|------------------------------------------------------------------------|------------|------|------|
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila-7B  | 76.9% |     |
--- a/examples/aquila/README_en.md
+++ b/examples/aquila/README_en.md
@ -1,203 +0,0 @@
-# Aquila-7B
-
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/aquila/README.md">简体中文</a></b> |
-        <b>English</b>
-    </p>
-</p>
-
-
- [Aquila-7B/13B](#aquila-7b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#Inference)
-  - [Evaluation with Benchmark](#Evaluation-with-Benchmark)
-## Training
-
-Here's a hardware summary of pre-training Aquila-7B:
-
-| Hardware | Device |      Value       |
-|:--------:|:------:|:----------------:|
-|   NPU    |  A+K   | 8 x Ascend NPUs  |
-
-### Script
-
-#### 1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-mkdir logs
-mkdir ckpt
-```
-
-#### 2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.7
-conda activate test
-# install torch and torch_npu
-pip install torch==2.1.0
-pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# enter the AscendSpeed/ directory and install other packages
-pip install -r requirements.txt
-```
-
-
-#### 3. Download the Aquila-7B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila-7B/tree/main)
-
-save to AscendSpeed/HF_Aquila7B_downloaded/ directory.
-
-
-#### 4. Prepare dataset.
-
-step1: Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to AscendSpeed/dataset/ directory.
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-
-step2: use Aquila-7B specified tokenizer to pre-process data:
-
-
-```shell
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./HF_Aquila7B_downloaded/ \
-    --output-prefix ./dataset/aquila \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --make-vocab-size-divisible-by 8 \
-    --pad-vocab-size-to 100008 \
-    --append-eod
-```
-
-#### 5. Weights convert
-
-Note: if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`
-
-step1: fix a bug:
-```python
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-```
-
-step2: convert the model pre-training weights.
-
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./HF_Aquila7B_downloaded/ \
-    --output-model-dir ./model_weights/aquila \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --make-vocab-size-divisible-by 8 \
-    --type 7B \
-    --deepspeed
-```
-
-
-#### 6. Config Aquila-7B pre-training script.
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./HF_Aquila7B_downloaded    #tokenizer path
-DATA=./dataset/aquila_text_document   #processed dataset
-CHECKPOINT=./model_weights/
-```
-*Note that if you do not load weights for pre-training, remove the `--load` parameter from the training script*
-
-#### 7. Launch Aquila-7B pre-training script.
-(note: if you see protoc version error, please uninstall existing protobuf and install protobuf==3.19.0 to solve it.)
-
-start training Aquila-7B model:
-```shell
-bash examples/aquila/pretrain_aquila_7B.sh
-```
-
-
-### Performance
-
-#### Machine performance
-
-The performance of Aquila-7B in **Ascend NPU** and **Reference**:
-
-| Device    | Hardware | Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|------|---------------|------------|------|------------------|----------------------|-----------------|------------------|
-| NPU  | 910b 1node*8p | Aquila-7B  | 1024 | 13.260            | 3394.56              | 4.8266           | 148.41           |
-| Reference  |  | Aquila-7B  |  |          | 4078             |           |            |
-
-
-
-#### Accuracy of the loss
-
-Aquila-7B NPU vs Reference loss.
-![NPU-GPU-Relative-Error](../../sources/images/aquila/aquila_comp1130.png)
-
-
-## Inference
-
-We support AscendSpeed Inference for text generation with Aquila 7B model.
-
-Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script `examples/aquila/generate_aquila_7B.sh`. "CHECKPOINT" must point to the converted weights directory, and "VOCAB_FILE" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./HF_Aquila7B_downloaded". In your operation, please fill in correct value based on your actual scenario.
-
-```shell
-# please change to actual values
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-Start Aquila-7B Inference:
-```shell
-bash ./examples/aquila/generate_aquila_7B.sh
-```
-
-Sample results of Aquila-7B Inference:
-
-![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate.png)
-
-
-## Evaluation with Benchmark
-
-We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark page](https://github.com/google-research-datasets/boolean-questions) and find the [dataset](https://storage.cloud.google.com/boolq/dev.jsonl), download it and save it. For example, save to "AscendSpeed/boolq/test" directory
-
-Evaluation task is similar to inference task，it also requires loading the pre-trained model weights. You can use the Aquila-7B evaluation script `examples/aquila/generate_aquila_7B.sh` as below:
-
-```shell
-# Start evaluation task
-bash examples/aquila/eval_aquila_7B.sh
-```
-
-Sample Aquila-7B performance running in **Ascend NPU**:
-
-| Task                                                                   | Model     | NPU | Benchmark |
-|------------------------------------------------------------------------|------------|------|------|
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila-7B  | 76.9% |     |
--- a/examples/aquila/generate_aquila_7B.sh
+++ b/examples/aquila/generate_aquila_7B.sh
@ -1,44 +0,0 @@
-#!/bin/bash
-source /path/to/cann/ascend-toolkit/set_env.sh
-
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=1
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="./model_weights/aquila/"
-VOCAB_FILE="./HF_Aquila7B_downloaded/"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --layernorm-epsilon 1e-6 \
-       --make-vocab-size-divisible-by 8 \
-       --use-flash-attn \
-       --pad-vocab-size-to 100008
--- a/examples/aquila/pretrain_aquila_7B.sh
+++ b/examples/aquila/pretrain_aquila_7B.sh
@ -1,107 +0,0 @@
-# This is an example: train llama using TD,
-# the number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=1
-source /path/to/cann/ascend-toolkit/set_env.sh
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-DATA=./dataset/aquila_text_document
-CHECKPOINT=./ckpt
-TOKENIZER_PATH=./HF_Aquila7B_downloaded/
-
-DS_CONFIG=deepspeed_config_7B.json
-ZERO_STAGE=2
-GLOBAL_BATCH=64
-MICRO_BATCH=2
-
-cat <<EOT > $DS_CONFIG
-{
-    "fp16": {
-        "enabled": true,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "optimizer": {
-        "type": "Adam"
-    },
-
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "allgather_partitions": true,
-        "allgather_bucket_size": 5e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 5e8,
-        "contiguous_gradients": true
-    },
-
-    "gradient_accumulation_steps": 4,
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true
-}
-EOT
-
-ds_args=""
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}"
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
-
-deepspeed pretrain_llama.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 1 \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 32 \
-       --hidden-size 4096 \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --train-iters 1024 \
-       --load $CHECKPOINT \
-       --data-path $DATA \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --attention-dropout 0.0 \
-       --hidden-dropout 0.0 \
-       --init-method-std 0.01 \
-       --split 900,50,50 \
-       --distributed-backend nccl \
-       --lr 1.0e-6 \
-       --lr-decay-style cosine \
-       --lr-warmup-fraction .01 \
-       --min-lr 1.0e-6 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.95 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-bias-gelu-fusion \
-       --use-flash-attn \
-       --layernorm-epsilon 1e-6 \
-       --make-vocab-size-divisible-by 8 \
-       --pad-vocab-size-to 100008 \
-       --use-fused-rmsnorm \
-       $ds_args \
-       --fp16 | tee logs/train_7B.log
--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -1,452 +0,0 @@
-# BaiChuan
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan/README_en.md">English</a> </b> 
-    </p>
-</p>
-
-
-
-#  目录
-
- [Baichuan-7B](#Baichuan-7B)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-    - [精度](#精度)
-
- [Baichuan-13B](#Baichuan-13B)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-    - [精度](#精度)
-  - [推理](#推理)
-  - [lora微调](#lora微调)
-  - [评估](#评估)
-
-# Baichuan-7B
-
-## 训练
-Baichuan-7B 训练的硬件配置如下：
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 拷贝仓库到你的个人服务器：
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu 
-pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-
-# 安装其余依赖库
-pip install -r requirements.txt 
-```
-
-3. （可选）准备预训练权重
-
-从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重：
-
-```text
-# 请注意，如果要加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
-# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
-# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
-
-# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None:
-    return False
-
-# 修改后
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None or len(zero_sd_list) == 0:
-    return False
-```
-
-```shell
-mkdir baichuan-7B-hf
-cd ./baichuan-7B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
-cd ..
-```
-
-接着将hf格式的权重转化为AscendSpeed可以加载的形式：
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan-7B-hf \
-    --output-model-dir ./weight \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --pse \
-    --deepspeed \
-    --use_wpack_rotray \
-    --load_weight_map   
-```
-
-
-4. 准备数据集
-
-从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集：
-
-```shell
-# 下载数据集
-mkdir dataset_baichuan7B
-cd ./dataset_baichuan7B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-# 准备数据集                              
-python ./tools/preprocess_data.py \
--input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset_baichuan7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
-```
-
-
-5. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_zero_7B.sh 
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改数据集，权重，词表等路径
-TOKENIZER_PATH=./baichuan-7B-hf/  #tokenizer 路径
-DATA_PATH=./dataset_baichuan7B/alpaca_text_document  #数据集路径
-# 如果要加载权重，可以增加参数 `--load ./weight`
-```
-
-6. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_zero_7B.sh 
-
-```shell
-bash examples/baichuan/pretrain_baichuan_zero_7B.sh 
-```
-
-### 性能
-
-#### 吞吐
-
-Baichuan-7B 使用 **昇腾芯片** 和 **参考芯片** 的吞吐对比:
-
-| 设备 | 模型     | 迭代 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|----|--------|----|--------------------|-----------------------|-----------------|------------------|
-| NPUs | Baichuan-7B | 1024 | 4.590              | 2350                  | 1.74            | 144.95           |
-| 参考 | Baichuan-7B | 1024 | 3.978              | 2036                  | 1.98            | 125.66           |
-
-
-
-#### 精度
-
-NPU vs 参考 loss.
-
-![NPU-LOSS](../../sources/images/baichuan/7B_loss_compare.png)
-
-NPU vs 参考 loss 相对误差.
-
-![NPU-Relative-Error](../../sources/images/baichuan/7B_relative_error.png)
-
-
-
-# Baichuan-13B
-
-## 训练
-
-Baichuan-13B 训练的硬件配置如下:
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 8 x Ascend NPUs |
-
-
-
-### 脚本
-1. 将仓库拷贝到你的个人服务器:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-2. 搭建环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-# 安装 megatron
-git clone https://github.com/NVIDIA/Megatron-LM.git -b 23.05
-cd Megatron-LM
-pip3 install -e ./
-cd ..
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# 安装其余依赖库
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-
-3. （可选的）准备预训练权重
-
-从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重
-```shell
-mkdir baichuan-13B-hf
-cd ./baichuan-13B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
-cd ..
-```
-
-将 BaiChuan-13B 模型权重从 huggingface 格式转换为 AscendSpeed 格式
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan-13B-hf \
-    --output-model-dir ./weight \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 13B \
-    --pse     
-```
-
-4. 准备数据集
-
-下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-mkdir dataset_baichuan13B
-cd ./dataset_baichuan13B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-python ./tools/preprocess_data.py \
-    --input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./baichuan-13B-hf \
-    --output-prefix ./dataset_baichuan13B/alpaca \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF 
-```
-
-
-5. 配置 Baichuan-13B 训练脚本: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
-
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改词表，数据集等路径
-TOKENIZER_PATH=./baichuan-13B-hf  
-DATA_PATH=./dataset_baichuan13B/alpaca_text_document  
-```
-
-6. 启动 Baichuan-13B 训练脚本: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
-
-```bash
-bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
-```
-
-```text
-当要开启FA时，在脚本中添加`--use-flash-attn`与`--square-alibi-mask`来使能，同时不要使用`--is-instruction-dataset`.
-```
-
-### 性能
-
-#### 吞吐
-
-Baichuan-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-|  设备  |      模型      | 迭代数  | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|:----:|:------------:|:----:|:------------------:|:--------------------:|:---------------:|:----------------:|
-| NPUs | Baichuan-13B | 1000 |       1.985        |         1016         |     16.121      |      88.47       |
-|  参考  | Baichuan-13B | 1000 |       1.535        |         862          |     19.852      |      72.39       |
-
-
-
-#### 精度
-
-NPU vs 参考 loss.
-
-
-![NPU-LOSS](../../sources/images/baichuan/13B-loss-compare.png)
-
-NPU vs 参考 loss 相对误差.
-
-![NPU-Relative-Error](../../sources/images/baichuan/baichuan13B-loss-relative-error.png)
-
-
-
-### 推理
-我们支持使用 Baichuan-13B 进行文本生成的推理。
-推理与预训练不同，比如我们需要加载预训练权重和输出样本的长度：
-
-配置Baichuan-13B推理脚本`examples/baichuan/generate_baichuan_13B_tp8_pp1.sh`。
-
-```shell
-# 配置模型权重路径和分词器路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-Baichuan-13B:
-```shell
-bash ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh
-```
-
-部分推理样本如下：
-![13B-inference](../../sources/images/baichuan/13B-inference.png)
-
- 如果在运行脚本的过程中遇到 "'BaichuanTokenizer' object has no attribute 'sp_model'" 的问题，请参考[huggingface链接解决](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/discussions)，或者更新transformers的版本.
-
-
-
-### Lora微调
-我们支持使用 Baichuan-13B 进行lora微调。
-微调时使用`指令微调数据集`，制作过程如下，注意添加`--handler-name GeneralInstructionHandler`
-
-```shell
-mkdir alpaca_preprocessed
-python tools/preprocess_data.py \
-    --input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --output-prefix ./alpaca_preprocessed/alpaca \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ./baichuan-13B-hf \
-    --tokenizer-not-use-fast \
-    --handler-name GeneralInstructionHandler \
-    --append-eod
-```
-配置 Baichuan-13B 的lora脚本`examples/baichuan/tune_baichuan_ptd_13B.sh`
-
-```shell
-# 配置数据集路径、初始megatron权重路径、词表路径以及保存权重的路径
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-```
-
-Baichuan-13B:
-```shell
-bash ./examples/baichuan/tune_baichuan_ptd_13B.sh
-```
-
-```shell
-# 再使用微调后的权重进行推理
-CHECKPOINT=<origin-ckpt-path>
-LORA_CHECKPOINT=<tune-weight-path>
-VOCAB_FILE=<tokenizer-path>
-```
-
-```shell
-bash ./examples/baichuan/generate_baichuan_lora_13B.sh
-```
-
-使用lora进行微调后的推理功能：
-![13B-lora-inference.png](../../sources/images/baichuan/13B-lora-inference.png)
-
-
-
-### 评估
-我们使用boolq基准来评估我们的模型。基准[下载](https://huggingface.co/datasets/boolq).
-
-```shell
-# 配置原始权重与词表的路径
-CHECKPOINT=<origin-ckpt-path>
-VOCAB_FILE=<tokenizer-path>
-# 配置任务以及数据路径
-DATA_PATH="./boolq/test/"
-TASK="boolq"
-```
-
-```shell
-bash ./tasks/evaluation/eval_baichuan_13B.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>验证集</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>Baichuan 13B</th>
-      <td>0.747</td>
-      <td><a href="https://opencompass.org.cn/dataset-detail/BoolQ">0.736</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/baichuan/README_en.md
+++ b/examples/baichuan/README_en.md
@ -1,461 +0,0 @@
-# BaiChuan
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan/README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-
-#  Contents
-
- [Baichuan-7B](#contents)
-  - [Training](#pre-training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-    - [Accuracy of the loss](#accuracy-of-the-loss)
-
- [Baichuan-13B](#contents)
-  - [Training](#pre-training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-    - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#inference)
-  - [Lora](#lora)
-  - [Evaluation](#evaluation)
-
-# Baichuan-7B
-
-## Training
-
-Here's a hardware summary of pre-training Baichuan-7B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu 
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-
-# install other packages
-pip install -r requirements.txt 
-```
-
-*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-     
-```text
-# original deepspeed/runtime/engine.py, about #Lines2746-2748
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None:
-    return False
-
-# modified
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None or len(zero_sd_list) == 0:
-    return False
-```
-3. Prepare pretrained weights
-Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 
-
-```shell
-mkdir baichuan-7B-hf
-cd ./baichuan-7B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
-wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
-cd ..
-```
-In order to adapt to the baichuan-7B model, the following script is used to convert the model pre-training weights.
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan-7B-hf \
-    --output-model-dir ./weight \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --pse \
-    --deepspeed \
-    --use_wpack_rotray \
-    --load_weight_map   
-```
-
-
-4. Prepare dataset
-
-Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-# download datasets
-mkdir dataset_baichuan7B
-cd ./dataset_baichuan7B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-# process datasets                              
-python ./tools/preprocess_data.py \
--input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset_baichuan7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
-```
-
-
-5. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_zero_7B.sh 
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./baichuan-7B-hf/  #tokenizer path
-DATA_PATH=./dataset_baichuan7B/alpaca_text_document  #processed dataset
-```
-
-6. Launch Baichuan-7B  pre-training script: examples/baichuan/pretrain_baichuan_zero_7B.sh 
-
-```shell
-bash examples/baichuan/pretrain_baichuan_zero_7B.sh 
-```
-*Note that if you want to train with weights from the huggingface, please add a parameter to the script  `pretrain_baichuan_zero_7B.sh` by inserting `--load ./weight` at lines 74 - 107 and rerun it.*
-
-
-### Performance
-
-#### Machine performance
-
-The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
-
-| Device | Model       | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| ------ | ----------- | ---------------- | ----------------------------- | ---------------------------- | ------------------------- | ----------------------------------- |
-| NPUs   | Baichuan-7B | 1024      | 4.590                      | 2350                         | 1.74                      | 144.95                              |
-| Reference   | Baichuan-7B | 1024             | 3.978                         | 2036                         | 1.98                      | 125.66                              |
-
-
-
-#### Accuracy of the loss
-
-NPU vs Reference loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.01093, less than 2%, the maximum relative error is 0.1243, and the maximum absolute error is 0.4859. The precision meets the requirements.
-
-![NPU-LOSS](../../sources/images/baichuan/7B_loss_compare.png)
-
-NPU vs Reference loss relative error.
-
-![NPU-Relative-Error](../../sources/images/baichuan/7B_relative_error.png)
-
-
-
-# Baichuan-13B
-
-## Training
-
-Here's a hardware summary of pre-training Baichuan-13B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs               |
-
-
-
-### Script
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu
-pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-#install megatron
-git clone https://github.com/NVIDIA/Megatron-LM.git -b 23.05
-cd Megatron-LM
-pip3 install -e ./
-cd ..
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# install other packages
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-
-3. Prepare pretrained weights
-
-
-Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 
-```shell
-mkdir baichuan-13B-hf
-cd ./baichuan-13B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
-cd ..
-```
-
-In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan-13B-hf \
-    --output-model-dir ./weight \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 13B \
-    --pse     
-```
-
-4. Prepare dataset
-Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-mkdir dataset_baichuan13B
-cd ./dataset_baichuan13B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-
-python ./tools/preprocess_data.py \
-    --input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./baichuan-13B-hf \
-    --output-prefix ./dataset_baichuan13B/alpaca \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF 
-```
-
-
-5. Config Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
-
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./baichuan-13B-hf  
-DATA_PATH=./dataset_baichuan13B/aplaca_text_document  
-```
-
-6. Launch Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
-
-```bash
-bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
-```
-
-```text
-When enable the FA, add '--use-flash-attn' and '--square-alibion-mask' to the script, and do not 
-use '--is-instruction-dataset'.
-```
-
-
-### Performance
-
-#### Machine performance
-
-The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
-
-| Device |    Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| :----: | :----------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
-|  NPUs  | Baichuan-13B |       1000       |             1.985             |             1016             |          16.121           |                88.47                |
-|  Reference  | Baichuan-13B |       1000       |             1.535             |             862              |          19.852           |                72.39                |
-
-
-
-#### Accuracy of the loss
-
-NPU vs Reference loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.00725, less than 2%, the maximum relative error is 0.01978, and the maximum absolute error is 0.10811. The precision meets the requirements.
-
-![NPU-LOSS](../../sources/images/baichuan/13B-loss-compare.png)
-
-NPU vs Reference loss relative error.
-
-The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.
-
-![NPU-Relative-Error](../../sources/images/baichuan/baichuan13B-loss-relative-error.png)
-\
-\
-<font size=1>If the download of the file fails using 'wget' , you can download it manually while ensuring website security.</font>
-
-
-
-### Inference
-We support AscendSpeed Inference for text generation with LLaMA-33B.
-Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
-
-Config Baichuan-13B inference script `examples/baichuan/generate_baichuan_13B_tp8_pp1.sh`.
-
-```shell
-# config the model weight path and tokenizer path
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-Baichuan-13B:
-```shell
-bash ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh
-```
-
-Some inference samples are as follows:
-![13B-inference.png](../../sources/images/baichuan/13B-inference.png)
-
- If the program raises the problem that "'BaichuanTokenizer' object has no attribute 'sp_model'" ，please refer to[huggingface link to solve](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/discussions)，or you can update the version of transformers under torch==2.1.
-
-
-
-### Lora
-We support AscendSpeed Lora fine-tuning with Baichuan-13B.
-When Fine-tuning using `instruction fine-tuning data set`, the production process is as follows, 
-pay attention to add ` --handler-name GeneralInstructionHandler `
-
-```shell
-mkdir alpaca_preprocessed
-python tools/preprocess_data.py \
-    --input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --output-prefix ./alpaca_preprocessed/alpaca \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ./baichuan-13B-hf \
-    --tokenizer-not-use-fast \
-    --handler-name GeneralInstructionHandler \
-    --append-eod
-```
-Configure Baichuan-13B lora script `examples/baichuan/tune_baichuan_ptd_13B.sh`
-
-```shell
-# configure the dataset path, initial megatron weight, tokenizer path and the path to save the lora weights
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-```
-
-Baichuan-13B:
-```shell
-bash ./examples/baichuan/tune_baichuan_ptd_13B.sh
-```
-
-```shell
-# Then use the fine-tuned weights for inference
-CHECKPOINT=<origin-ckpt-path>
-LORA_CHECKPOINT=<tune-weight-path>
-VOCAB_FILE=<tokenizer-path>
-```
-
-Baichuan-13B:
-```shell
-bash ./examples/baichuan/generate_baichuan_lora_13B.sh
-```
-
-FineTune with lora and operate inference
-![13B-lora-inference.png](../../sources/images/baichuan/13B-lora-inference.png)
-
-
-
-### Evaluation
-We use the boolq benchmark to evaluate our model. Benchmark[Download](https://huggingface.co/datasets/boolq).
-
-```shell
-# config origin weight and vocab file path
-CHECKPOINT=<origin-ckpt-path>
-VOCAB_FILE=<tokenizer-path>
-# config tasks and dataset path
-DATA_PATH="./boolq/test/"
-TASK="boolq"
-```
-
-```shell
-bash ./tasks/evaluation/eval_baichuan_13B.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Subset</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>Baichuan 13B</th>
-      <td>0.747</td>
-      <td><a href="https://opencompass.org.cn/dataset-detail/BoolQ">0.736</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/baichuan/generate_baichuan_13B_tp8_pp1.sh
+++ b/examples/baichuan/generate_baichuan_13B_tp8_pp1.sh
@ -1,38 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13696 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 40  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type alibi \
-       --normalization RMSNorm \
--- a/examples/baichuan/generate_baichuan_lora_13B.sh
+++ b/examples/baichuan/generate_baichuan_lora_13B.sh
@ -1,44 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your origin megatron ckpt"
-LORA_CHECKPOINT="tune weight"
-VOCAB_FILE="tokenizer path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13696 \
-       --load "${CHECKPOINT}"  \
-       --lora-load "${LORA_CHECKPOINT}" \
-       --num-attention-heads 40  \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-       --position-embedding-type alibi \
-       --normalization RMSNorm \
--- a/examples/baichuan/pretrain_baichuan_ptd_13B.sh
+++ b/examples/baichuan/pretrain_baichuan_ptd_13B.sh
@ -1,69 +0,0 @@
-# This is an example: training llama using PTD/
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=12892
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-GLOBAL_BATCH=32
-MICRO_BATCH=1
-
-DATA_PATH=./data/baichuan_text
-TOKENIZER_PATH=./tokenizer
- 
-CHECKPOINT_PATH=./ckpt
-LOAD_PATH=./weight
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-logfile=$(date +%Y%m%d)_$(date +%H%M%S)
-rm -rf kernel_meta*
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_baichuan.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 1 \
-       --sequence-parallel \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13696 \
-       --num-attention-heads 40 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 4096 \
-       --normalization RMSNorm \
-       --max-position-embeddings 4096 \
-       --train-iters 100000 \
-       --save $CHECKPOINT_PATH \
-       --load $LOAD_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 1e-6 \
-       --lr-decay-style cosine \
-       --min-lr 1e-8 \
-       --weight-decay 1e-1 \
-       --position-embedding-type alibi \
-       --clip-grad 1.0 \
-       --initial-loss-scale 8188.0 \
-       --seed 1234 \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.95 \
-       --adam-eps 1.0e-5 \
-       --log-interval 1 \
-       --save-interval 1000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --fp16 | tee logs/loss_${logfile}.log
--- a/examples/baichuan/pretrain_baichuan_zero_7B.sh
+++ b/examples/baichuan/pretrain_baichuan_zero_7B.sh
@ -1,109 +0,0 @@
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE * $NNODES))
-
-DATA_PATH=dataset/llama_text_document
-CHECKPOINT_PATH=ckpt
-TOKENIZER_PATH=tokenizer
-
-DS_CONFIG=ds_config.json
-ZERO_STAGE=2
-MICRO_BATCH=1
-GLOBAL_BATCH=8
-
-rm -rf kernel_meta*
-
-cat <<EOT >$DS_CONFIG
-{
-    "gradient_accumulation_steps": 1,
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true,
-    "fp16": {
-        "enabled": true,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "optimizer": {
-      "type": "AdamW",
-      "params": {
-        "lr": 2e-5,
-        "eps": 1.0e-8,
-        "betas": [
-          0.9,
-          0.95
-        ],
-        "weight_decay": 0.0
-      }
-    },
-
-    "steps_per_print": 1,
-
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "contiguous_gradients": false,
-        "allgather_bucket_size": 1e8,
-        "reduce_bucket_size": 1e8,
-        "overlap_comm": true,
-        "reduce_scatter": true
-    }
-}
-EOT
-
-ds_args=""
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}"
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
-
-SEQ_LEN=4096
-
-deepspeed  pretrain_baichuan.py \
-  --DDP-impl local \
-  --tensor-model-parallel-size 1 \
-  --pipeline-model-parallel-size 1 \
-  --num-layers 32 \
-  --hidden-size 4096 \
-  --position-embedding-type rope \
-  --normalization RMSNorm \
-  --ffn-hidden-size 11008 \
-  --num-attention-heads 32 \
-  --micro-batch-size $MICRO_BATCH \
-  --global-batch-size $GLOBAL_BATCH \
-  --seq-length $SEQ_LEN \
-  --max-position-embeddings $SEQ_LEN \
-  --train-iters 1024 \
-  --data-path $DATA_PATH \
-  --tokenizer-name-or-path $TOKENIZER_PATH \
-  --tokenizer-not-use-fast \
-  --data-impl mmap \
-  --split 949,50,1 \
-  --distributed-backend nccl \
-  --lr 0.00001 \
-  --lr-decay-style cosine \
-  --min-lr 1.0e-6 \
-  --weight-decay 1e-2 \
-  --clip-grad 1.0 \
-  --lr-warmup-fraction .1 \
-  --log-interval 1 \
-  --save-interval 10000 \
-  --eval-interval 1000 \
-  --eval-iters 1 \
-  --checkpoint-activations \
-  --recompute-method uniform \
-  --recompute-num-layers 30 \
-  --use-flash-attn \
-  $ds_args \
-  --fp16 | tee logs/train.log
--- a/examples/baichuan/tune_baichuan_ptd_13B.sh
+++ b/examples/baichuan/tune_baichuan_ptd_13B.sh
@ -1,83 +0,0 @@
-# This is an example: train llama using PTD.
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6002
-NNODES=1
-NODE_RANK=0    #1
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-MICRO_BATCH=4
-GRADIENT_ACCUMULATION_STEP=1
-GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
-EPOCH=5
-TRAIN_ITERS=$((5200 / $GLOBAL_BATCH * $EPOCH))
-echo $TRAIN_ITERS
-SAVE_INTERVAL=$(($TRAIN_ITERS / 4))
-echo $SAVE_INTERVAL
-
-TP=8
-PP=1
-
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch ${DISTRIBUTED_ARGS} \
-       pretrain_baichuan.py \
-       --DDP-impl local \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13696 \
-       --num-attention-heads 40 \
-       --micro-batch-size ${MICRO_BATCH} \
-       --global-batch-size ${GLOBAL_BATCH} \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --position-embedding-type alibi \
-       --normalization RMSNorm \
-       --train-iters ${TRAIN_ITERS} \
-       --save ${SAVE_CHECKPOINT_PATH} \
-       --load ${LOAD_CHECKPOINT_PATH} \
-       --checkpoint-activations \
-       --data-path ${DATA_PATH} \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 2e-5 \
-       --lr-decay-style cosine \
-       --lr-decay-iters ${TRAIN_ITERS} \
-       --min-lr 0 \
-       --weight-decay 0. \
-       --clip-grad 1.0 \
-       --lr-warmup-iters 200 \
-       --log-interval 1 \
-       --save-interval ${SAVE_INTERVAL} \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --initial-loss-scale 8188.0 \
-       --seed 1234 \
-       --zero-stage 2 \
-       --is-instruction-dataset \
-       --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --fp16 | tee logs/train_13B_megatron.log
-
--- a/examples/baichuan2/README.md
+++ b/examples/baichuan2/README.md
@ -1,402 +0,0 @@
-# BaiChuan2
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan2/README_en.md">English</a> </b> 
-    </p>
-</p>
-
-
-#  目录
- [Baichuan2-7B](#Baichuan2-7B)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-    - [精度](#精度)
- [Baichuan2-13B](#baichuan2-13b)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-    - [精度](#精度)
-
-# Baichuan2-7B
-
-## 训练
-Baichuan2-7B 训练的硬件配置如下：
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 拷贝仓库到你的个人服务器：
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.10
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu 
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-
-# 安装其余依赖库
-pip install -r requirements.txt 
-```
-
-3. （可选）准备预训练权重
-
-从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重：
-
-```shell
-mkdir baichuan2-7B-hf
-cd ./baichuan2-7B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
-cd ..
-```
-
-接着将hf格式的权重转化为AscendSpeed可以加载的形式：
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-# for ptd
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan2-7B-hf \
-    --output-model-dir ./weight-tp8 \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --merge-mlp \
-    --pse  
-```
-
-
-4. 准备数据集
-
-从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集：
-
-```shell
-# 下载数据集
-mkdir dataset_baichuan2-7B
-cd ./dataset_baichuan2-7B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-# 准备数据集                              
-python ./tools/preprocess_data.py \
--input ./dataset_baichuan2-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan2-7B-hf \
--output-prefix ./dataset_baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
-```
-
-
-5. 配置 Baichuan2-7B 预训练脚本: examples/baichuan/pretrain_baichuan2_ptd_7B.sh 
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改数据集，权重，词表等路径
-TOKENIZER_PATH=./baichuan2-7B-hf/  #tokenizer 路径
-DATA_PATH=./dataset_baichuan2-7B/alpaca_text_document  #数据集路径
-# 如果要加载权重，可以增加参数 `--load ./weight`
-```
-
-6. 启动 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-
-```shell
-bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-```
-
-### 性能
-
-#### 吞吐
-
-Baichuan2-7B 使用 **昇腾芯片** 和 **参考芯片** 的吞吐对比:
-
-| 设备 | 模型           | 迭代 | 样本吞吐 (samples/s) | tokens吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|----|--------------|----|------------------|-----------------------|-----------------|------------------|
-| NPUs | Baichuan2-7B | 1024 | 5.125            | 2607                  | 24.97           | 124              |
-| 参考 | Baichuan2-7B | 1024 | --               | 3969                  | --              | --               |
-
-
-
-#### 精度
-
-NPU vs 参考 loss.
-
-![NPU-LOSS](../../sources/images/baichuan2/7B_loss_compare.png)
-
-NPU vs 参考 loss 相对误差.
-
-![NPU-Relative-Error](../../sources/images/baichuan2/7B_relative_error.png)
-
-
-
-# Baichuan2-13B
-
-## 训练
-Baichuan2-13B 训练的硬件配置如下:
-
-|  硬件 |        配置        |
-|:---:|:----------------:|
-| NPU | 16 x Ascend NPUs |
-
-### 脚本
-1. 将仓库拷贝到你的个人服务器:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-2. 搭建环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-# 安装 megatron
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# 安装其余依赖库
-# 请注意trasformers==4.29.2
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-
-3. （可选的）准备预训练权重
-
-从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 下载预训练权重
-```shell
-mkdir Baichuan2-13B-Base
-cd ./Baichuan2-13B-Base
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00001-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00002-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00003-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenizer_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenizer.model
-cd ..
-```
-
-将 Baichuan2-13B 模型权重从 huggingface 格式转换为 AscendSpeed 格式
-```shell
-mkdir baichuan2-13b-merge
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./Baichuan2-13B-Base \
-    --output-model-dir ./baichuan2-13b-merge \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --make-vocab-size-divisible-by 8 \
-    --merge-mlp \
-    --type 13B \
-    --pse     
-```
-
-4. 准备数据集
-
-下载 Baichuan2-13B [数据集](https://huggingface.co/datasets/fnlp/moss-003-sft-data) 
-
-```shell
-mkdir processed_data_of_moss
-cd ./processed_data_of_moss
-wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/blob/main/moss-003-sft-no-tools.jsonl.zip
-unzip moss-003-sft-no-tools.jsonl.zip
-cd ..
-
-python ./tools/preprocess_data.py \
-    --input ./processed_data_of_moss/moss-003-sft-no-tools.jsonl \
-    --tokenizer-name-or-path ./Baichuan2-13B-Base \
-    --output-prefix ./processed_data_of_moss/processed_data \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF \
-    --handler-name MOSSMultiTurnHandler
-```
-
-
-5. 配置 Baichuan2-13B 训练脚本: /examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改词表，数据集, 权重等路径等路径
-TOKENIZER_PATH=./Baichuan2-13B-Base 
-DATA_PATH=./processed_data_of_moss/processed_data
-LOAD_PATH=./baichuan2-13b-merge
-
-# 修正双机运行配置
-# MASTER_ADDR=xx.xx.x.xxx配置为主服务器ip
-# NODE_RANK主服务器脚本里设置为0，另一台服务器脚本里设置为1
-```
-
-如果需要开启FA，请遵循以下配置
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改词表，数据集, 权重等路径等路径
-TOKENIZER_PATH=./Baichuan2-13B-Base 
-DATA_PATH=./processed_data_of_moss/processed_data_packed_input_ids_document
-LOAD_PATH=./baichuan2-13b-merge
-
-# 修正双机运行配置
-# MASTER_ADDR=xx.xx.x.xxx配置为主服务器ip
-# NODE_RANK主服务器脚本里设置为0，另一台服务器脚本里设置为1
-# 设置MICRO_BATCH
-MICRO_BATCH = 2
-
-# 增加FA
--use-flash-attn
-# 删除参数 --fill-neg-inf
-# 删除微调数据集参数 --is-instruction-dataset
-# 删除参数 --padding-attention-mask
-# 增加选择性重计算参数
--auto-recompute-device-size 57344
-```
-
-6. 启动 Baichuan2-13B 训练脚本: /examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-
-```bash
-# 请在双机上分别运行该命令，双机会自动同步通信，开始进程运行
-bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-```
-
-### 性能
-
-#### 吞吐
-
-不开启FA情况下，Baichuan2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-|  设备  |            模型             | 迭代数  | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|:----:|:-------------------------:|:----:|:------------------:|:--------------------:|:---------------:|:----------------:|
-| NPUs | Baichuan2-13B | 1000 |       3.312        |         852          |      76.89      |      75.40       |
-|  参考  |               |      |                    |         872         |                 |                  |
-
-
-
-#### 精度
-
-NPU vs 参考 loss.
-
-![NPU-LOSS](../../sources/images/baichuan2/13B-loss-compare.png)
-
-NPU vs 参考 loss 相对误差.
-
-![NPU-Relative-Error](../../sources/images/baichuan2/baichuan2-13B-loss-relative-error.png)
-
-### 推理
-我们支持使用 Baichuan2-13B 进行文本生成的推理。
-推理与预训练不同，比如我们需要加载预训练权重和输出样本的长度：
-
-配置Baichuan2-13B推理脚本`examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh`。
-
-```shell
-# 配置模型权重路径和分词器路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-Baichuan2-13B:
-```shell
-bash examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh
-```
-
-部分推理样本如下：
-![13B-inference](../../sources/images/baichuan2/13B-inference.png)
-
-### 评估
-我们使用boolq基准来评估我们的模型。基准[下载](https://huggingface.co/datasets/boolq).
-
-```shell
-# 配置原始权重与词表的路径
-CHECKPOINT=<origin-ckpt-path>
-VOCAB_FILE=<tokenizer-path>
-# 配置任务以及数据路径
-DATA_PATH="./boolq/test/"
-TASK="boolq"
-```
-
-```shell
-bash ./tasks/evaluation/eval_baichuan2_13B_tp8_pp1.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>验证集</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>Baichuan2 13B</th>
-      <td>0.790</td>
-      <td><a href="https://opencompass.org.cn/dataset-detail/BoolQ">0.670</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/baichuan2/README_en.md
+++ b/examples/baichuan2/README_en.md
@ -1,420 +0,0 @@
-# BaiChuan2
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan2/README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-
-#  Contents
- [Baichuan2-7B](#contents)
-  - [Training](#pre-training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-    - [Accuracy of the loss](#accuracy-of-the-loss)
- [Baichuan2-13B](#baichuan2-13b)
-  - [Training](#training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-    - [Accuracy of the loss](#accuracy-of-the-loss)
-
-# Baichuan2-7B
-
-## Training
-
-Here's a hardware summary of pre-training Baichuan2-7B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-
-### Script
-
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.10
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu 
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-
-# install other packages
-pip install -r requirements.txt 
-```
-
-
-3. Prepare pretrained weights
-Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)
-
-
-
-```shell
-mkdir baichuan2-7B-hf
-cd ./baichuan2-7B-hf
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
-wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
-cd ..
-```
-
-In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
-```shell
-mkdir weight
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-# for ptd
-python $SCRIPT_PATH \
-    --input-model-dir ./baichuan2-7B-hf \
-    --output-model-dir ./weight-tp8 \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --merge-mlp \
-    --pse  
-```
-
-
-4. Prepare dataset
-
-Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-# download datasets
-mkdir dataset_baichuan2-7B
-cd ./dataset_baichuan2-7B
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-
-# process datasets                              
-python ./tools/preprocess_data.py \
--input ./dataset_baichuan2-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan2-7B-hf \
--output-prefix ./dataset_baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
-```
-
-
-5. Config Baichuan2-7B pre-training script : examples/baichuan/pretrain_baichuan2_ptd_7B.sh 
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./baichuan2-7B-hf/  #tokenizer path
-DATA_PATH=./dataset_baichuan2-7B/alpaca_text_document  #processed dataset
-```
-*Note that if you want to train with weights from the huggingface, please add a parameter to the script  `pretrain_baichuan2_ptd_7B.sh` by inserting `--load ./weight` *
- 
-6. Launch Baichuan2-7B  pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-
-```shell
-bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-```
-
-
-### Performance
-
-#### Machine performance
-
-The performance of Baichuan2-7B in **Ascend NPU** and **Reference**:
-
-| Device | Model        | total Iterations | throughput rate (samples/s) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| ------ |--------------| ---------------- | --------------------------- |------------------------------| ------------------------- | ----------------------------------- |
-| NPUs | Baichuan2-7B | 1024 | 5.125            | 2607                         | 24.97           | 124              |
-| Reference    | Baichuan2-7B | 1024 | --               | 3969                         | --              | --               |
-
-
-
-#### Accuracy of the loss
-
-NPU vs Reference loss.
-
-![NPU-LOSS](../../sources/images/baichuan2/7B_loss_compare.png)
-
-NPU vs Reference loss relative error.
-
-![NPU-Relative-Error](../../sources/images/baichuan2/7B_relative_error.png)
-
-
-
-
-# Baichuan2-13B
-
-## Training
-
-Here's a hardware summary of pre-training Baichuan2-13B:
-
-| Hardware |      Value       |
-| :------: |:----------------:|
-|   NPU    | 16 x Ascend NPUs |
-
-
-
-### Script
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu
-pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-#install megatron
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# install other packages
-# set trasformers==4.29.2
-pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-```
-
-3. Prepare pretrained weights
-
-
-Download the Baichuan2-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 
-```shell
-mkdir Baichuan2-13B-Base
-cd ./Baichuan2-13B-Base
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/configuration_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/generation_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/modeling_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00001-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00002-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model-00003-of-00003.bin
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/pytorch_model.bin.index.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/quantizer.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/special_tokens_map.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenization_baichuan.py
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenizer_config.json
-wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/blob/main/tokenizer.model
-cd ..
-```
-
-In order to adapt to the Baichuan2-13B model, the following script is used to convert the model pre-training weights.
-```shell
-mkdir baichuan2-13b-merge
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./Baichuan2-13B-Base \
-    --output-model-dir ./baichuan2-13b-merge \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --make-vocab-size-divisible-by 8 \
-    --merge-mlp \
-    --type 13B \
-    --pse      
-```
-
-4. Prepare dataset
-Download the Baichuan2-13B datasets from [here](https://huggingface.co/datasets/fnlp/moss-003-sft-data) 
-
-```shell
-mkdir processed_data_of_moss
-cd ./processed_data_of_moss
-wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/blob/main/moss-003-sft-no-tools.jsonl.zip
-unzip moss-003-sft-no-tools.jsonl.zip
-cd ..
-
-python ./tools/preprocess_data.py \
-    --input ./processed_data_of_moss/moss-003-sft-no-tools.jsonl \
-    --tokenizer-name-or-path ./Baichuan2-13B-Base \
-    --output-prefix ./processed_data_of_moss/processed_data \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF \
-    --handler-name MOSSMultiTurnHandler
-```
-
-
-5. Config Baichuan2-13B pre-training script: /examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./Baichuan2-13B-Base 
-DATA_PATH=./processed_data_of_moss/processed_data
-LOAD_PATH=./baichuan2-13b-merge
-
-# set config for two-node parallelism
-# modify MASTER_ADDR=xx.xx.x.xxx to master node IP
-# NODE_RANK is set to 0 in the master node script and to 1 in another.
-```
-
-If you need turn on FA, you should do following change:
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./Baichuan2-13B-Base 
-DATA_PATH=./processed_data_of_moss/processed_data
-LOAD_PATH=./baichuan2-13b-merge
-
-# set config for two-node parallelism
-# modify MASTER_ADDR=xx.xx.x.xxx to master node IP
-# NODE_RANK is set to 0 in the master node script and to 1 in another.
-# set MICRO_BATCH 
-MICRO_BATCH = 2
-
-# add FA argument
--use-flash-attn
-# remove argument --fill-neg-inf
-# remove argument --is-instruction-dataset
-# remove argument --padding-attention-mask
-# add argument for auto selective recompute strategy
--auto-recompute-device-size 57344
-```
-
-6. Launch Baichuan2-13B pre-training script: /examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-
-```bash
-bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-```
-
-There is an hourly pulse checking script running that checks that the training is either running or scheduled.
-
-
-### Performance
-
-#### Machine performance
-
-With FA operator off, the performance of the Baichuan2-13B in **Ascend NPU** and **Reference**:
-
-| Device |     Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|:----:|:-------------------------:|:----:|:------------------:|:--------------------:|:---------------:|:----------------:|
-| NPUs | Baichuan2-13B | 1000 |        3.312        |         852          |      76.89      |      75.40       |
-|  Reference  |               |      |                    |         872         |                 |                  |
-
-
-
-#### Accuracy of the loss
-
-NPU vs Reference loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The maximum relative error is 0.0266, and the maximum absolute error is 0.0228. The precision meets the requirements.
-
-![NPU-LOSS](../../sources/images/baichuan2/13B-loss-compare.png)
-
-NPU vs Reference loss relative error.
-
-![NPU-Relative-Error](../../sources/images/baichuan2/baichuan2-13B-loss-relative-error.png)
-\
-\
-The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.
-
-
-### Inference
-We support AscendSpeed Inference for text generation with LLaMA-33B.
-Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
-
-Config Baichuan2-13B inference script `examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh`.
-
-```shell
-# config the model weight path and tokenizer path
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-Baichuan2-13B:
-```shell
-bash examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh
-```
-
-Some inference samples are as follows:
-![13B-inference.png](../../sources/images/baichuan2/13B-inference_en.png)
-
-### Evaluation
-We use the boolq benchmark to evaluate our model. Benchmark[Download](https://huggingface.co/datasets/boolq).
-
-```shell
-# config origin weight and vocab file path
-CHECKPOINT=<origin-ckpt-path>
-VOCAB_FILE=<tokenizer-path>
-# config tasks and dataset path
-DATA_PATH="./boolq/test/"
-TASK="boolq"
-```
-
-```shell
-bash ./tasks/evaluation/eval_baichuan2_13B_tp8_pp1.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Subset</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>Baichuan2 13B</th>
-      <td>0.790</td>
-      <td><a href="https://opencompass.org.cn/dataset-detail/BoolQ">0.670</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/baichuan2/eval_baichuan2_13B_tp8_pp1.sh
+++ b/examples/baichuan2/eval_baichuan2_13B_tp8_pp1.sh
@ -1,47 +0,0 @@
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6011
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="Your ckpt file path"
-VOCAB_FILE="Your vocab file path"
-DATA_PATH="../dataset/boolq/test"
-TASK="boolq"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation_llama.py   \
-       --task-data-path $DATA_PATH \
-       --task $TASK\
-       --seq-length 2048 \
-       --max-new-tokens 2 \
-       --max-position-embeddings 2048 \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13696 \
-       --load ${CHECKPOINT}  \
-       --num-attention-heads 40  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path $VOCAB_FILE \
-       --tokenizer-not-use-fast \
-       --make-vocab-size-divisible-by 8 \
-       --square-alibi-mask \
-       --lm-norm-weight \
-       --fill-neg-inf \
-       --bf16  \
-       --micro-batch-size 1  \
-       --position-embedding-type alibi \
-       --normalization RMSNorm \
-       --seed 42 | tee logs/train.log
--- a/examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh
+++ b/examples/baichuan2/generate_baichuan2_13B_tp8_pp1.sh
@ -1,42 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13696 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 40  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --make-vocab-size-divisible-by 8 \
-       --bf16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type alibi \
-       --normalization RMSNorm \
-       --square-alibi-mask \
-       --lm-norm-weight \
-       --fill-neg-inf \
-
--- a/examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
+++ b/examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
@ -1,82 +0,0 @@
-# This is an example: training llama using PTD/
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-
-# Change for multinode config
-MASTER_ADDR=xx.xx.x.xx
-MASTER_PORT=12892
-NNODES=2
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-GLOBAL_BATCH=256
-MICRO_BATCH=1
-
-DATA_PATH=./data/baichuan2_txt
-TOKENIZER_PATH=./tokenizer
-
-CHECKPOINT_PATH=./ckpt
-LOAD_PATH=./weight
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-logfile=$(date +%Y%m%d)_$(date +%H%M%S)
-rm -rf kernel_meta*
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_baichuan.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 1 \
-       --sequence-parallel \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13696 \
-       --num-attention-heads 40 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 4096 \
-       --normalization RMSNorm \
-       --use-fused-rmsnorm \
-       --max-position-embeddings 4096 \
-       --train-iters 1000 \
-       --save $CHECKPOINT_PATH \
-       --load $LOAD_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --make-vocab-size-divisible-by 8 \
-       --distributed-backend nccl \
-       --lr 2e-5 \
-       --lr-decay-style constant \
-       --min-lr 1e-8 \
-       --weight-decay 1e-4 \
-       --position-embedding-type alibi \
-       --clip-grad 1.0 \
-       --layernorm-epsilon 1e-6 \
-       --initial-loss-scale 8188.0 \
-       --z-loss-weight 0 \
-       --lm-norm-weight \
-       --keep-last-token \
-       --is-instruction-dataset \
-       --square-alibi-mask \
-       --fill-neg-inf \
-       --padding-attention-mask \
-       --release-fp32-grad \
-       --mlp-layer-fusion \
-       --use-distributed-optimizer \
-       --seed 1234 \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.98 \
-       --adam-eps 1e-8 \
-       --log-interval 1 \
-       --save-interval 1000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --bf16 | tee logs/loss_${logfile}.log
-       
--- a/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
+++ b/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
@ -1,71 +0,0 @@
-# This is an example: training llama using PTD/
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=12892
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE * $NNODES))
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-DATA_PATH=./dataset_baichuan2-7B/alpaca_text_document
-TOKENIZER_PATH=./baichuan2-7B-hf/
-CHECKPOINT_PATH=./ckpt
-LOAD_PATH=./weight
-
-MICRO_BATCH=4
-GLOBAL_BATCH=128
-
-logfile=$(date +%Y%m%d)_$(date +%H%M%S)
-rm -rf kernel_meta*
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-  pretrain_baichuan.py \
-  --DDP-impl local \
-  --tensor-model-parallel-size 8 \
-  --pipeline-model-parallel-size 1 \
-  --sequence-parallel \
-  --num-layers 32 \
-  --hidden-size 4096 \
-  --position-embedding-type rope \
-  --normalization RMSNorm \
-  --ffn-hidden-size 11008 \
-  --num-attention-heads 32 \
-  --micro-batch-size $MICRO_BATCH \
-  --global-batch-size $GLOBAL_BATCH \
-  --seq-length 4096 \
-  --max-position-embeddings 4096 \
-  --train-iters 100000 \
-  --save $CHECKPOINT_PATH \
-  --load $LOAD_PATH \
-  --data-path $DATA_PATH \
-  --tokenizer-name-or-path $TOKENIZER_PATH \
-  --tokenizer-not-use-fast \
-  --data-impl mmap \
-  --split 949,50,1 \
-  --distributed-backend nccl \
-  --lr 1e-6 \
-  --lr-decay-style cosine \
-  --min-lr 1e-8 \
-  --weight-decay 1e-2 \
-  --clip-grad 1.0 \
-  --seed 1234 \
-  --adam-beta1 0.9 \
-  --adam-beta2 0.95 \
-  --log-interval 1 \
-  --save-interval 1000 \
-  --eval-interval 1000 \
-  --eval-iters 10 \
-  --mlp-layer-fusion \
-  --use-flash-attn \
-  --use-fused-rotary-pos-emb \
-  --use-fused-rmsnorm \
-  --lm-norm-weight \
-  --bf16 | tee logs/loss_${logfile}.log
--- a/examples/bloom/README.md
+++ b/examples/bloom/README.md
@ -1,562 +0,0 @@
-# Bloom
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/bloom/README_en.md">English</a> </b> 
-</p>
-
-
-# 目录
-
- [Bloom-7B](#Bloom-7B)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-  - [微调](#微调)
-    - [全参微调](#全参微调)
-    - [低参微调](#低参微调)
-  - [推理](#推理)
-    - [deepspeed_pipeline](#deepspeed_pipeline)
-    - [megatron](#megatron)
-  - [评估](#评估)
- [Bloom-176B](#Bloom-176B)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-  - [推理](#推理)
-    - [deepspeed_pipeline](#deepspeed_pipeline)
-    - [megatron](#megatron)
-  - [评估](#评估)
- [举例](#举例)
-
-# Bloom-7B
-
-## 训练
-
-Bloom-7B 训练的硬件配置如下:
-
-| **硬件** | **配置**         |
-|--------|----------------|
-| NPU    | 1x8 Ascend 910 | 
-
-## 脚本
-
-1. 拷贝仓库到本地服务器：
-
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-```
-
-2. 搭建环境
-
-```shell
-# python3.8
-conda create -n bloom7b python=3.8
-conda activate bloom7b
-
-# 安装 torch 和 torch_npu
-pip install torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.0.1rc1.postxxxxxxxx-cp38-cp38-linux_aarch64.whl
-
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# 安装其余包
-pip install -r requirements.txt
-```
-
-3. 准备预训练权重
-
-首先下载 Bloom-7B 的 [权重](https://huggingface.co/bigscience/bloom-7b1/tree/main)
-
-```shell
-mkdir tokenizer
-cd tokenizer
-wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-...
-cd ..
-```
-
-将权重从 huggingface 格式转化为 ascendspeed 可以加载的格式
-
-```shell
-#!/bin/bash
-
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir "your huggingface checkpoint path" \
-    --output-model-dir "your ascendspeed checkpoint path" \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --deepspeed
-```
-
-4. 准备数据集
-
-下载 Bloom-7B 的 [enwiki数据集](https://huggingface.co/datasets/teven/enwiki_100k).
-
-```shell
-# 下载数据集
-mkdir enwiki_100k_datasets
-cd enwiki_100k_datasets
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00000-of-00006-67bcc7d401923db0.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00001-of-00006-6b8562cbb05789a4.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00002-of-00006-62d2b426a93b0912.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00003-of-00006-36c3d6da04c724b6.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00004-of-00006-48bdf99256dcfa5d.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00005-of-00006-bcb3b3af8d7a4140.parquet
-cd ..
-
-# 预处理数据
-python ./tools/preprocess_data.py \
-  --input ./enwiki_100k_datasets/ \
-  --tokenizer-name-or-path ./tokenizer \
-  --output-prefix ./enwiki_100k_datasets/enwiki-100k \
-  --worker 4 \
-  --log-interval 1000 \
-  --tokenizer-type PretrainedFromHF
-```
-
-5. 配置 Bloom-7B 预训练脚本: examples/bloom/pretrain_bloom_7b1.sh
-
-```shell
-# 修改数据集和词表路径
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
-```
-
-6. 启动 Bloom-7B 预训练脚本: examples/bloom/pretrain_bloom_7b1.sh
-
-```shell
-bash examples/bloom/pretrain_bloom_7b1.sh
-```
-
-## 微调
-
-### 全参微调
-执行流程与预训练一致，配置训练权重路径如下：
-```shell
-# 修改预训练权重路径
-CHECKPOINT_PATH='./ckpt'
-```
-
-### 低参微调
-启动 Bloom-7B 低参微调脚本: examples/bloom/tune_bloom_7b1.sh
-
-```shell
-# 修改预训练权重路径
-CHECKPOINT_PATH='./ckpt'
-# 修改数据集和词表路径
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/alpaca/alpaca
-```
-
-```shell
-bash examples/bloom/tune_bloom_7b1.sh
-```
-
-## 性能
-
-### 吞吐
-
-Bloom-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-| 设备  | 模型       | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|-----|----------|-----|--------------------|-----------------------|-----------------|------------------|
-| NPUs | Bloom-7B | 1000 | 9.779            | 2503                  | 19.63           | 109.85           |
-| 参考  | Bloom-7B | 1000 | 9.894              | 2525                  | 19.40           | 111.19           |
-
-
-
-### 精度
-
-NPU vs 参考 loss
-
-
-![7b_lm_loss.png](..%2F..%2Fsources%2Fimages%2Fbloom%2F7b_lm_loss.png)
-
-NPU vs 参考 loss 相对误差
-
-![relative_error.png](..%2F..%2Fsources%2Fimages%2Fbloom%2Frelative_error.png)
-
-## 推理
-
-AscendSpeed 支持 BLOOM 7B 的文本生成推理.
-
-### deepspeed_pipeline
-```text
-    # 请注意，评估时需要修改一个deepspeed的bug：
-    # 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
-    # self.total_loss += self.loss.detach()
-```
-```shell
-# 修改 model weight 路径和 tokenizer 路径
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_7b_deepspeed_pipeline.sh
-```
-
-
-### megatron
-
-使用 [convert_weights_from_gptmodelpipe_to_gptmodel.sh](../../tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel.sh) 将bloom-7B的权重转换为推理格式
-
-```bash
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel_v2.py
-python $SCRIPT_PATH \
-    --input-model-dir ${INPUT_PATH} \
-    --output-model-dir ${OUTPUT_PATH} \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B
-```
-
-配置 Bloom-7B 推理脚本: examples/bloom/generate_bloom_7B_tp8_pp1.sh
-
-```shell
-# 修改 model weight 路径和 tokenizer 路径
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_7B_tp8_pp1.sh
-```
-
-## 评估 
-配置 Bloom-7B 评估脚本: tasks/evaluation/evaluate_bloom_7b1.sh
-
-```shell
-# 修改 model weight 路径和 tokenizer 路径和数据集任务路径
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-DATA_PATH="/dataset/boolq/test"
-TASK="boolq"
-```
-
-```text
-    # 请注意，评估时需要修改一个deepspeed的bug：
-    # 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
-    # self.total_loss += self.loss.detach()
-```
-
-```shell
-bash tasks/evaluation/evaluate_bloom_7b1.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>验证集</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>bloom 7b</th>
-      <td>0.614</td>
-      <td>--</td>
-    </tr>
-  </tbody>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/cais/mmlu">mmlu</a></td>
-      <td>Test</td>
-      <th>bloom 7b</th>
-      <td>0.251</td>
-      <td><a href="https://www.hayo.com/article/648ace24409528db3186ef1c">0.254</a></td>
-    </tr>
-  </tbody>
-</table>
-
-
-# Bloom-176B
-
-## 训练
-
-BLOOM 模型源于: [A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100).
-
-> Scao, Teven Le, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022).
-
-Bloom-176B 训练的硬件配置:
-
-| **硬件** | **配置**          |
-|--------|-----------------|
-| NPU    | 12x8 Ascend 910 | 
-
-## 脚本
-
-1. 将仓库拉取到个人服务器
-
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-```
-
-2. 搭建环境
-
-```shell
-# python3.8
-conda create -n bloom176b python=3.8
-conda activate bloom176b
-
-# 安装 torch 和 torch_npu
-pip install torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.0.1rc1.postxxxxxxxx-cp38-cp38-linux_aarch64.whl
-pip install apex-0.1_ascend_xxxxxxxx-cp38-cp38-linux_aarch64.whl
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# 安装其余依赖
-pip install -r requirements.txt
-```
-
-3. 准备预训练权重
-
-下载 Bloom-176B [权重](https://huggingface.co/bigscience/bloom/tree/main)
-
-```shell
-mkdir tokenizer
-cd tokenizer
-wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-...
-cd ..
-```
-
-将权重格式从 huggingface 格式转换为 AscendSpeed 格式：
-
-```shell
-#!/bin/bash
-
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir "your huggingface checkpoint path" \
-    --output-model-dir "your ascendspeed checkpoint path" \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 12 \
-    --type 176B \
-    --deepspeed \
-    --partition-layers 6,6,6,6,6,6,6,6,6,6,6,4
-# partition-layers 指定的是PP当中每个stage的层数，总和需要等于70
-```
-4. 准备数据集
-
-下载 Bloom-176B 的 [数据集](https://huggingface.co/datasets/teven/enwiki_100k). 
-
-```shell
-# 下载数据集
-mkdir enwiki_100k_datasets
-cd enwiki_100k_datasets
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00000-of-00006-67bcc7d401923db0.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00001-of-00006-6b8562cbb05789a4.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00002-of-00006-62d2b426a93b0912.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00003-of-00006-36c3d6da04c724b6.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00004-of-00006-48bdf99256dcfa5d.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00005-of-00006-bcb3b3af8d7a4140.parquet
-cd ..
-
-# 处理数据集
-python ./tools/preprocess_data.py \
-  --input ./enwiki_100k_datasets/ \
-  --tokenizer-name-or-path ./tokenizer \
-  --output-prefix ./enwiki_100k_datasets/enwiki-100k \
-  --worker 4 \
-  --log-interval 1000 \
-  --tokenizer-type PretrainedFromHF
-```
-
-5. 配置 Bloom-176B 预训练脚本: examples/bloom/pretrain_bloom_176b.sh
-
-```shell
-# 修改 MASTER_ADDR 为主节点 IP，比如, 90.90.2.166
-MASTER_ADDR=localhost
-
-# 修改每个节点的节点序号，主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
-NODE_RANK=0
-
-# 修改数据集路径和词表路径
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
-```
-
-6. 启动 Bloom-176B 预训练脚本: examples/bloom/pretrain_bloom_176b.sh
-
-在集群中的每个节点上启动 examples/bloom/pretrain_bloom_176b.sh 脚本
-
-```shell
-bash examples/bloom/pretrain_bloom_176b.sh
-```
-
-```text
-当要开启FA时，在脚本中添加`--use-flash-attn`与`--square-alibi-mask`来使能，同时不要使用`--is-instruction-dataset`.
-```
-
-## 性能
-
-### 吞吐
-
-Bloom-176B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-| 设备 | 模型         | 总迭代数 | tokens吞吐 (tokens/p/s) |
-|----|------------|------|-----------------------|
-| NPUs | Bloom-176B | 1000 | 108                   |
-| 参考 | Bloom-176B | NA   | 107                   |
-
-### 精度
-
-NPU vs 参考 loss 
-
-![bloom176b_lm_loss_compare](../../sources/images/bloom/bloom176b_lm_loss_compare.PNG)
-
-单节点loss对比
-
-![bloom176b_1node_lm_loss_compare](../../sources/images/bloom/bloom176b_lm_loss_1node_compare.PNG)
-
-## 推理
-
-AscendSpeed 支持 BLOOM 176B的在线文本生成推理
-We support AscendSpeed Inference for text generation with BLOOM 176B (deepspeed or megatron).
-
-### deepspeed_pipeline
-```text
-    # 请注意，评估时需要修改一个deepspeed的bug：
-    # 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
-    # self.total_loss += self.loss.detach()
-```
-```shell
-# # 修改 model weight 路径和 tokenizer 路径
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_176b_deepspeed_pipeline.sh
-```
-
-### megatron
-
-使用 [convert_weights_from_gptmodelpipe_to_gptmodel.sh](../../tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel.sh) 脚本将权重转化为推理格式。
-推理需要两节点运行，需要我们手工将权重同步到两节点下，0号节点需要 1-37 层权重，1号节点需要 38-74 层权重，执行脚本如下：
-```bash
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel_v2.py
-python $SCRIPT_PATH \
-    --input-model-dir ${INPUT_PATH} \
-    --output-model-dir ${OUTPUT_PATH} \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 2 \
-    --type 176B
-```
-### 脚本
-
-配置 Bloom-176B 推理脚本: examples/bloom/generate_bloom_176b_2nodes.sh
-
-```shell
-# 修改 MASTER_ADDR 为主节点 IP，比如, 90.90.2.166
-MASTER_ADDR=localhost
-
-# 修改每个节点的节点序号，主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
-NODE_RANK=0
-
-# 修改数据集路径和词表路径
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_176b_2nodes.sh
-```
-
-
-## 评估 
-
-配置 Bloom-176B 评估脚本: tasks/evaluation/evaluate_bloom_176b.sh
-
-```shell
-# 修改 model weight 路径和 tokenizer 路径和数据集任务路径
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-DATA_PATH="/dataset/boolq/test"
-TASK="boolq"
-```
-
-```text
-    # 请注意，评估时需要修改一个deepspeed的bug：
-    # 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
-    # self.total_loss += self.loss.detach()
-```
-
-```shell
-bash tasks/evaluation/evaluate_bloom_176b.sh
-```
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>验证集</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>bloom 176b</th>
-      <td>0.657</td>
-      <td>--</td>
-    </tr>
-  </tbody>
-</table>
-
-# 举例
-1. bloom 7b
-
-![bloom_7b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_7b_generate.png)
-2. bloom 176b
-
-![bloom_176b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_176b_generate.png)
-
-# 引用
-
-```
-@article{scao2022bloom,
-  title={Bloom: A 176b-parameter open-access multilingual language model},
-  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili{\'c}, Suzana and Hesslow, Daniel and Castagn{\'e}, Roman and Luccioni, Alexandra Sasha and Yvon, Fran{\c{c}}ois and Gall{\'e}, Matthias and others},
-  journal={arXiv preprint arXiv:2211.05100},
-  year={2022}
-}
-```
--- a/examples/bloom/README_en.md
+++ b/examples/bloom/README_en.md
@ -1,583 +0,0 @@
-# Bloom
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/bloom/README.md">简体中文</a></b> |
-        <b>English</b> 
-</p>
-
-
-
-# Contents
-
- [Bloom-7B](#contents)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#Machine-performance)
-      - [Accuracy of the loss](#Accuracy-of-the-loss)
-  - [Fine-tune](#fine-tune)
-    - [Full parameter fine-tuning](#Full-parameter-fine-tuning)
-    - [LORA fine-tuning](#LORA-fine-tuning)
-  - [Inference](#inference)
-    - [deepspeed pipeline](#deepspeed-pipeline)
-    - [megatron](#megatron)
-  - [Evaluation](#evaluation)
-
- [Bloom-176B](#contents)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#inference)
-    - [deepspeed pipeline](#deepspeed-pipeline)
-    - [megatron](#megatron)
-  - [Evaluation](#evaluation)
- [Example](#example)
-
-
-# Bloom-7B
-
-## Training
-
-
-Here's a hardware summary of pre-training Bloom-7B:
-
-| **Hardware** | **Value** |
-| ------------ | --------- |
-| NPU          | 1x8 Ascend 910 | 
-
-## Script
-
-1. Clone the repository to your local server
-
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-```
-
-2. Build enviroment
-
-```shell
-# python3.8
-conda create -n bloom7b python=3.8
-conda activate bloom7b
-
-# install torch and torch_npu and apex
-pip install torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.0.1rc1.postxxxxxxxx-cp38-cp38-linux_aarch64.whl
-pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# install other packages
-pip install -r requirements.txt
-```
-
-3. Prepare pretrained weights
-
-Download the Bloom-7B tokensizer from [here](https://huggingface.co/bigscience/bloom-7b1/tree/main).
-
-```shell
-mkdir tokenizer
-cd tokenizer
-wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-cd ..
-```
-
-We provide scripts that support converting pretrained weights into weights that AscendSpeed can load and used for train and inference. 
-
-```shell
-#!/bin/bash
-
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir "your huggingface checkpoint path" \
-    --output-model-dir "your ascendspeed checkpoint path" \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --deepspeed
-```
-
-4. Prepare dataset
-
-Download the Bloom-7B datasets from [here](https://huggingface.co/datasets/teven/enwiki_100k). The downloaded dataset is in the parquet format by default.
-You need to convert the dataset to the loose json format and preprocess the dataset.
-
-```shell
-# download datasets
-mkdir enwiki_100k_datasets
-cd enwiki_100k_datasets
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00000-of-00006-67bcc7d401923db0.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00001-of-00006-6b8562cbb05789a4.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00002-of-00006-62d2b426a93b0912.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00003-of-00006-36c3d6da04c724b6.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00004-of-00006-48bdf99256dcfa5d.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00005-of-00006-bcb3b3af8d7a4140.parquet
-cd ..
-
-# preprocess datasets
-python ./tools/preprocess_data.py \
-  --input ./enwiki_100k_datasets/ \
-  --tokenizer-name-or-path ./tokenizer \
-  --output-prefix ./enwiki_100k_datasets/enwiki-100k \
-  --worker 4 \
-  --log-interval 1000 \
-  --tokenizer-type PretrainedFromHF
-```
-
-5. Config Bloom-7B pre-training script: examples/bloom/pretrain_bloom_7b1.sh
-
-```shell
-# modify the datasets path and tokenizer path
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
-```
-
-6. Launch Bloom-7B pre-training script: examples/bloom/pretrain_bloom_7b1.sh
-
-Run the examples/bloom/pretrain_bloom_7b1.sh on all nodes in the cluster.
-
-```shell
-bash examples/bloom/pretrain_bloom_7b1.sh
-```
-
-## Fine-tune
-
-### Full parameter fine-tuning
-
-The execution process is the same as the pre-training. Config training weight path is as follows:
-
-```shell
-# modify the model weight path
-CHECKPOINT_PATH='./ckpt'
-```
-
-### LORA fine-tuning
-
-Launch Bloom-7B pre-training script: examples/bloom/tune_bloom_7b1.sh
-
-```shell
-# modify the model weight path
-CHECKPOINT_PATH='./ckpt'
-
-# modify the datasets path and tokenizer path
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/alpaca/alpaca
-```
-
-```shell
-bash examples/bloom/tune_bloom_7b1.sh
-```
-
-## Performance
-
-### Machine Performance
-
-The performance of Bloom-7B in **Ascend NPU** and **Reference**:
-
-| Device | Model    | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| ------ |----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------|
-| NPUs   | Bloom-7B | 1000             | 9.779                         | 2503                         | 19.63                     | 109.85                              |
-| Reference   | Bloom-7B | 1000             | 9.894                         | 2525                         | 19.40                     | 111.19                              |
-
-
-
-#### Accuracy of the loss
-
-NPU vs GPU loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. 
-
-![7b_lm_loss.png](..%2F..%2Fsources%2Fimages%2Fbloom%2F7b_lm_loss.png)
-
-NPU vs GPU loss relative error.
-
-![relative_error.png](..%2F..%2Fsources%2Fimages%2Fbloom%2Frelative_error.png)
-
-## Inference
-
-We support AscendSpeed Inference for text generation with BLOOM 7B (deepspeed or megatron).
-
-### deepspeed pipeline
-```text
-    # Note that, a deepspeed bug needs to be fixed during evaluation：
-    # Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
-    # self.total_loss += self.loss.detach()
-```
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_7b_deepspeed_pipeline.sh
-```
-
-### megatron
-
-Use [convert_weights_from_gptmodelpipe_to_gptmodel.sh](../../tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel.sh), converting deepspeed checkpoints to megatron.
-
-```bash
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel_v2.py
-python $SCRIPT_PATH \
-    --input-model-dir ${INPUT_PATH} \
-    --output-model-dir ${OUTPUT_PATH} \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B
-```
-
-We generate text samples using the `generate_bloom` script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:
-
-Config Bloom-7B inference script: examples/bloom/generate_bloom_7B_tp8_pp1.sh
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_7B_tp8_pp1.sh
-```
-
-## Evaluation 
-Config Bloom-7B evaluation script: tasks/evaluation/evaluate_bloom_7b1.sh
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_7B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-DATA_PATH="/dataset/boolq/test"
-TASK="boolq"
-```
-
-```text
-    # Note that, a deepspeed bug needs to be fixed during evaluation：
-    # Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
-    # self.total_loss += self.loss.detach()
-```
-
-```shell
-bash tasks/evaluation/evaluate_bloom_7b1.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Subset</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>bloom 7b</th>
-      <td>0.614</td>
-      <td>--</td>
-    </tr>
-  </tbody>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/cais/mmlu">mmlu</a></td>
-      <td>Test</td>
-      <th>bloom 7b</th>
-      <td>0.251</td>
-      <td><a href="https://www.hayo.com/article/648ace24409528db3186ef1c">0.254</a></td>
-    </tr>
-  </tbody>
-</table>
-
-
-# Bloom-176B
-
-## Training
-
-BLOOM model is from: [A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100).
-
-> Scao, Teven Le, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022).
-
-Here's a hardware summary of pre-training Bloom-176B:
-
-| **Hardware** | **Value** |
-| ------------ | --------- |
-| NPU          | 12x8 Ascend 910 | 
-
-
-## Script
-
-1. Clone the repository to your local server
-
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git
-cd AscendSpeed
-```
-
-2. Build enviroment
-
-```shell
-# python3.8
-conda create -n bloom176b python=3.8
-conda activate bloom176b
-
-# install torch and torch_npu and apex
-pip install torch-2.0.1-cp38-cp38-manylinux2014_aarch64.whl
-pip install torch_npu-2.0.1rc1.postxxxxxxxx-cp38-cp38-linux_aarch64.whl
-pip install apex-0.1_ascend_xxxxxxxx-cp38-cp38-linux_aarch64.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# install other packages
-pip install -r requirements.txt
-```
-
-3. Prepare pretrained weights
-
-Download the Bloom-176B tokensizer from [here](https://huggingface.co/bigscience/bloom/tree/main).
-
-```shell
-mkdir tokenizer
-cd tokenizer
-wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-cd ..
-```
-We provide scripts that support converting pretrained weights into weights that AscendSpeed can load and used for train and inference. `--partition-layers` specifies the partitioning strategy under the pipeline parallel strategy, you can also modify it to a different strategy, but the sum of all elements of `--partition layers` should be equal to 70 and the number of elements in `--partition-layers` should be equal to `--pipeline-model-parallel-size`.
-
-```shell
-#!/bin/bash
-
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir "your huggingface checkpoint path" \
-    --output-model-dir "your ascendspeed checkpoint path" \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 12 \
-    --type 176B \
-    --deepspeed \
-    --partition-layers 6,6,6,6,6,6,6,6,6,6,6,4
-```
-4. Prepare dataset
-
-Download the Bloom-176B datasets from [here](https://huggingface.co/datasets/teven/enwiki_100k). The downloaded dataset is in the parquet format by default.
-You need to convert the dataset to the loose json format and preprocess the dataset.
-
-```shell
-# download datasets
-mkdir enwiki_100k_datasets
-cd enwiki_100k_datasets
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00000-of-00006-67bcc7d401923db0.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00001-of-00006-6b8562cbb05789a4.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00002-of-00006-62d2b426a93b0912.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00003-of-00006-36c3d6da04c724b6.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00004-of-00006-48bdf99256dcfa5d.parquet
-wget https://huggingface.co/datasets/teven/enwiki_100k/resolve/main/data/train-00005-of-00006-bcb3b3af8d7a4140.parquet
-cd ..
-
-# preprocess datasets
-python ./tools/preprocess_data.py \
-  --input ./enwiki_100k_datasets/ \
-  --tokenizer-name-or-path ./tokenizer \
-  --output-prefix ./enwiki_100k_datasets/enwiki-100k \
-  --worker 4 \
-  --log-interval 1000 \
-  --tokenizer-type PretrainedFromHF
-```
-
-5. Config Bloom-176B pre-training script: examples/bloom/pretrain_bloom_176b.sh
-
-```shell
-# modify MASTER_ADDR to the IP address of the master node in the cluster.
-# the master node is localhost, and the other nodes are the IP address of the master node, for example, 90.90.2.166
-MASTER_ADDR=localhost
-
-# modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
-NODE_RANK=0
-
-# modify the datasets path and tokenizer path
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
-```
-
-6. Launch Bloom-176B pre-training script: examples/bloom/pretrain_bloom_176b.sh
-
-Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
-
-```shell
-bash examples/bloom/pretrain_bloom_176b.sh
-```
-
-```text
-When enable the FA, add '--use-flash-attn' and '--square-alibion-mask' to the script, and do not 
-use '--is-instruction-dataset'.
-```
-
-## Performance
-
-### Machine Performance
-
-The performance of Bloom-176B in **Ascend NPU** and **Reference**:
-
-| Devices | Model | total iterations | throughput rate (tokens/s/p) |
-| ------- | ----- |-----------------| ---------------------------- |
-| NPUs    | Bloom-176B | 1000            | 108                          |
-| Reference | Bloom-176B | NA              | 107                          |
-
-### Accuracy of the loss
-
-NPU vs GPU loss. The loss curves of GPUs and NPUs basically coincide.
-
-![bloom176b_lm_loss_compare](../../sources/images/bloom/bloom176b_lm_loss_compare.PNG)
-
-We reduce the number of layers of the model to six, the following figure shows the loss comparsion between the NPU 
-and GPU on a single-node system. The average relative error is 0.1%, less than 2%, and the proportion of relative error less than 2% reaches 99.9%. The average absolute error is 0.04. The precision meets the requirements.
-
-![bloom176b_1node_lm_loss_compare](../../sources/images/bloom/bloom176b_lm_loss_1node_compare.PNG)
-
-## Inference
-
-We support AscendSpeed Inference for text generation with BLOOM 176B (deepspeed or megatron).
-
-### deepspeed pipeline
-```text
-    # Note that, a deepspeed bug needs to be fixed during evaluation：
-    # Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
-    # self.total_loss += self.loss.detach()
-```
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_176b_deepspeed_pipeline.sh
-```
-
-### megatron.
-
-Use [convert_weights_from_gptmodelpipe_to_gptmodel.sh](../../tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel.sh), converting deep speed checkpoints to megatron.Convert the checkpoint of deepspeed to megtron.
-
-We use two-machine reasoning. First of all, we need to manually move the pre-trained ckpt to the two machines, node 0 requires layer 1-37, node 1 requires layer 38-74, move the conversion script configuration directory and related parameters, and execute the conversion.
-```bash
-SCRIPT_PATH=./tools/ckpt_convert/bloom/convert_weights_from_gptmodelpipe_to_gptmodel_v2.py
-python $SCRIPT_PATH \
-    --input-model-dir ${INPUT_PATH} \
-    --output-model-dir ${OUTPUT_PATH} \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 2 \
-    --type 176B
-```
-### Script
-We generate text samples using the `generate_bloom` script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:
-
-Config Bloom-176B inference script: examples/bloom/generate_bloom_176b_2nodes.sh
-
-```shell
-# modify MASTER_ADDR to the IP address of the master node in the cluster.
-# the master node is localhost, and the other nodes are the IP address of the master node, for example, 90.90.2.166
-MASTER_ADDR=localhost
-
-# modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
-NODE_RANK=0
-
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-```
-
-```shell
-bash ./examples/bloom/generate_bloom_176b_2nodes.sh
-```
-
-## Evaluation 
-Config Bloom-7B evaluation script: tasks/evaluation/evaluate_bloom_176b.sh
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=/home/model/bloom_176B
-VOCAB_FILE=/home/bloom_data/vocab_file/
-DATA_PATH="/dataset/boolq/test"
-TASK="boolq"
-```
-
-```text
-    # Note that, a deepspeed bug needs to be fixed during evaluation：
-    # Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
-    # self.total_loss += self.loss.detach()
-```
-
-```shell
-bash tasks/evaluation/evaluate_bloom_176b.sh
-```
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Subset</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/boolq">Boolq</a></td>
-      <td>Test</td>
-      <th>bloom 176b</th>
-      <td>0.657</td>
-      <td>--</td>
-    </tr>
-  </tbody>
-</table>
-
-## Example
-1. bloom 7b
-![bloom_7b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_7b_generate.png)
-2. bloom 176b
-![bloom_176b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_176b_generate.png)
-
-All the provided scripts are tested on 910 64GB NPUs for BLOOM 7B and BLOOM 176B (fp16). These scripts might not work for other models or a different number of NPUs.
-
-> Note: Sometimes NPUs memory is not freed when inference deployment crashes. You can free this memory by running kill all python in terminal.
-
-## Citation
-
-You may also consider original work in your reference:
-
-```
-@article{scao2022bloom,
-  title={Bloom: A 176b-parameter open-access multilingual language model},
-  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili{\'c}, Suzana and Hesslow, Daniel and Castagn{\'e}, Roman and Luccioni, Alexandra Sasha and Yvon, Fran{\c{c}}ois and Gall{\'e}, Matthias and others},
-  journal={arXiv preprint arXiv:2211.05100},
-  year={2022}
-}
-```
-\
-\
-<font size=1>If the download of the file fails using 'wget' , you can download it manually while ensuring website security.</font>
--- a/examples/bloom/generate_176b_2nodes.sh
+++ b/examples/bloom/generate_176b_2nodes.sh
@ -1,48 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-MASTER_ADDR=**.**.**.**
-MASTER_PORT=12890
-NNODES=2
-NPUS_PER_NODE=8
-NODE_RANK=1
-
-VOCAB_FILE="your VOCAB FILE path"
-CHECKPOINT="your checkpoint path"
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-# Real script
-python -m torch.distributed.run $DISTRIBUTED_ARGS ./tasks/inference/inference_gpt.py \
-               --no-contiguous-buffers-in-local-ddp \
-               --load ${CHECKPOINT} \
-               --tokenizer-type PretrainedFromHF \
-               --tokenizer-name-or-path ${VOCAB_FILE}  \
-               --tensor-model-parallel-size 8 \
-               --pipeline-model-parallel-size 2 \
-               --embed-layernorm \
-               --position-embedding-type alibi \
-               --num-layers 70  \
-               --hidden-size 14336  \
-               --num-attention-heads 112 \
-               --max-position-embeddings 2048 \
-               --seq-length 2048 \
-               --micro-batch-size 1 \
-               --init-method-std 0.0048 \
-               --layernorm-epsilon 1e-6 \
-               --fp16 \
-               --no-load-optim \
-               --no-load-rng \
-               --no-add-gate \
-               --add-bias-linear \
-               --query-key-layer-scaling \
-               --no-attention-softmax-in-fp32 \
-               --no-untie-embeddings-and-output-weights
--- a/examples/bloom/generate_bloom_176b_deepspeed_pipeline.sh
+++ b/examples/bloom/generate_bloom_176b_deepspeed_pipeline.sh
@ -1,61 +0,0 @@
-#!/bin/bash
-
-export TOKENIZERS_PARALLELISM=false
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_bloom_pipeline.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 70  \
-       --hidden-size 14336  \
-       --num-attention-heads 112  \
-       --max-position-embeddings 2048 \
-       --position-embedding-type alibi \
-       --embed-layernorm \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --no-add-gate \
-       --add-bias-linear \
-       --query-key-layer-scaling \
-       --no-attention-softmax-in-fp32 \
-       --no-untie-embeddings-and-output-weights \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
--- a/examples/bloom/generate_bloom_7B_tp8_pp1.sh
+++ b/examples/bloom/generate_bloom_7B_tp8_pp1.sh
@ -1,47 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-
-VOCAB_FILE="your VOCAB FILE path"
-CHECKPOINT="your checkpoint path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_gpt.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 30  \
-       --hidden-size 4096  \
-       --num-attention-heads 32 \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --load "${CHECKPOINT}"  \
-       --embed-layernorm \
-       --position-embedding-type alibi \
-       --no-add-gate \
-       --add-bias-linear \
-       --query-key-layer-scaling \
-       --no-attention-softmax-in-fp32 \
-       --no-untie-embeddings-and-output-weights
--- a/examples/bloom/generate_bloom_7b_deepspeed_pipeline.sh
+++ b/examples/bloom/generate_bloom_7b_deepspeed_pipeline.sh
@ -1,62 +0,0 @@
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-export TOKENIZERS_PARALLELISM=false
-export INF_NAN_MODE_ENABLE=0
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_bloom_pipeline.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 30  \
-       --hidden-size 4096  \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --position-embedding-type alibi \
-       --embed-layernorm \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --no-add-gate \
-       --add-bias-linear \
-       --query-key-layer-scaling \
-       --no-attention-softmax-in-fp32 \
-       --no-untie-embeddings-and-output-weights
--- a/examples/bloom/pretrain_bloom_176b.sh
+++ b/examples/bloom/pretrain_bloom_176b.sh
@ -1,121 +0,0 @@
-#!/bin/bash
-export LD_LIBRARY_PATH=/usr/local/lib:/home/anaconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1600
-# Enable memory reuse in INF_NAN mode can reduce memory usage and achieve lossless performance
-export MULTI_STREAM_MEMORY_REUSE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# If this environment variable is set, all nodes will compile the dataset,
-# which is suitable for cluster training.
-export AZUREML_EXPERIMENT_ID=0
-
-# output data path
-CHECKPOINT_PATH='./ckpt'
-TENSORBOARD_PATH='./tensorboard/'
-LOGS_PATH='./logs/'
-mkdir -p $LOGS_PATH
-
-# train parameter 
-MASTER_ADDR=localhost
-MASTER_PORT=12890
-GPUS_PER_NODE=8
-NNODES=12
-NODE_RANK=0
-PP_SIZE=12
-TP_SIZE=8
-
-MICRO_BATCH_SIZE=2
-GLOBAL_BATCH_SIZE=2048
-
-NLAYERS=70
-NHIDDEN=14336
-NHEADS=112
-SEQ_LEN=2048
-
-SAVE_INTERVAL=5000
-
-TRAIN_SAMPLES=220_000_000  # 450B tokens
-LR_DECAY_SAMPLES=200_000_000  # Decay for the first 410B tokens then continue at fixed --min-lr
-LR_WARMUP_SAMPLES=183_105  # 375M tokens
-
-# dataset path
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/oscar_data_1g/my-gpt2_text_document
-
-ZERO_STAGE=0 # important: bf16 must use z0! it implements its own zero stage 1 equivalent
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "train_batch_size": $GLOBAL_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "bf16": {
-    "enabled": true
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-TRANSFORMERS_OFFLINE=1  \
-    python -m torch.distributed.run $DISTRIBUTED_ARGS \
-    pretrain_bloom.py \
-    --tokenizer-type PretrainedFromHF \
-    --embed-layernorm \
-    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
-    --data-path $DATA_PATH \
-    --pad-vocab-size-to 250880 \
-    --tensor-model-parallel-size $TP_SIZE \
-    --pipeline-model-parallel-size $PP_SIZE \
-    --num-layers $NLAYERS \
-    --hidden-size $NHIDDEN \
-    --num-attention-heads $NHEADS \
-    --seq-length $SEQ_LEN \
-    --max-position-embeddings $SEQ_LEN \
-    --micro-batch-size $MICRO_BATCH_SIZE \
-    --rampup-batch-size 192 16 9_765_625 \
-    --global-batch-size $GLOBAL_BATCH_SIZE \
-    --train-samples $TRAIN_SAMPLES \
-    --normalization LayerNorm \
-    --init-method-std 0.0048 \
-    --bf16 \
-    --seed 42 \
-    --position-embedding-type alibi \
-    --optimizer adam \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --adam-eps 1e-8 \
-    --lr 6e-5 \
-    --min-lr 6e-6 \
-    --lr-decay-style cosine \
-    --lr-decay-samples $LR_DECAY_SAMPLES \
-    --lr-warmup-samples $LR_WARMUP_SAMPLES \
-    --clip-grad 1.0 \
-    --weight-decay 1e-1 \
-    --log-interval 1 \
-    --save $CHECKPOINT_PATH \
-    --save-interval $SAVE_INTERVAL \
-    --eval-interval 1000 \
-    --eval-iters 1 \
-    --load $CHECKPOINT_PATH \
-    --data-impl mmap \
-    --distributed-backend nccl \
-    --deepspeed \
-    --deepspeed_config ${config_json} \
-    --zero-stage ${ZERO_STAGE} \
-    --deepspeed-activation-checkpointing  \
-    --sequence-parallel \
-    --checkpoint-activations \
-    --use-manual-layer-allocation \
-    --manual-layers 5,6,6,6,6,6,6,6,6,6,6,5 \
-    --no-add-gate \
-    --add-bias-linear \
-    --query-key-layer-scaling \
-    --no-attention-softmax-in-fp32 \
-    --no-untie-embeddings-and-output-weights
--- a/examples/bloom/pretrain_bloom_7b1.sh
+++ b/examples/bloom/pretrain_bloom_7b1.sh
@ -1,115 +0,0 @@
-#!/bin/bash
-
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# output data path
-CHECKPOINT_PATH='./ckpt'
-LOGS_PATH='./logs/'
-mkdir -p $LOGS_PATH
-
-# train parameter 
-MASTER_ADDR=localhost
-MASTER_PORT=5999
-GPUS_PER_NODE=8
-NNODES=1
-
-MICRO_BATCH_SIZE=4
-GLOBAL_BATCH_SIZE=512
-
-NLAYERS=30
-NHIDDEN=4096
-NHEADS=32
-SEQ_LEN=2048
-
-SAVE_INTERVAL=250
-
-TRAIN_SAMPLES=220_000_000  # 450B tokens
-LR_DECAY_SAMPLES=200_000_000  # Decay for the first 410B tokens then continue at fixed --min-lr
-LR_WARMUP_SAMPLES=183_105  # 375M tokens
-
-# dataset path
-TOKENIZER_NAME_OR_PATH=./dataset/bloom_vocab/vocab_file/
-DATA_PATH=./dataset/oscar_data_1g/my-gpt2_text_document
-
-ZERO_STAGE=0 # important: bf16 must use z0! it implements its own zero stage 1 equivalent
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "train_batch_size": $GLOBAL_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --max_restarts 0 --tee 3"
-
-TRANSFORMERS_OFFLINE=1  \
-    python -m torch.distributed.run $DISTRIBUTED_ARGS \
-    pretrain_bloom.py \
-    --tokenizer-type PretrainedFromHF \
-    --embed-layernorm \
-    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
-    --data-path $DATA_PATH \
-    --pad-vocab-size-to 250880 \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --num-layers $NLAYERS \
-    --hidden-size $NHIDDEN \
-    --num-attention-heads $NHEADS \
-    --normalization LayerNorm \
-    --seq-length $SEQ_LEN \
-    --max-position-embeddings $SEQ_LEN \
-    --micro-batch-size $MICRO_BATCH_SIZE \
-    --rampup-batch-size 192 16 9_765_625 \
-    --global-batch-size $GLOBAL_BATCH_SIZE \
-    --train-samples $TRAIN_SAMPLES \
-    --init-method-std 0.0048 \
-    --fp16 \
-    --seed 42 \
-    --position-embedding-type alibi \
-    --optimizer adam \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --adam-eps 1e-8 \
-    --lr 1.2e-4 \
-    --min-lr 6e-6 \
-    --lr-decay-style cosine \
-    --lr-decay-samples $LR_DECAY_SAMPLES \
-    --lr-warmup-samples $LR_WARMUP_SAMPLES \
-    --clip-grad 1.0 \
-    --weight-decay 1e-1 \
-    --log-interval 1 \
-    --save-interval $SAVE_INTERVAL \
-    --eval-interval 1000 \
-    --eval-iters 1 \
-    --save $CHECKPOINT_PATH \
-    --load $CHECKPOINT_PATH \
-    --data-impl mmap \
-    --deepspeed \
-    --deepspeed_config ${config_json} \
-    --zero-stage ${ZERO_STAGE} \
-    --distributed-backend nccl \
-    --no-add-gate \
-    --add-bias-linear \
-    --query-key-layer-scaling \
-    --no-attention-softmax-in-fp32 \
-    --sequence-parallel \
-    --no-untie-embeddings-and-output-weights
-    
--- a/examples/bloom/pretrain_bloom_7b_ptd.sh
+++ b/examples/bloom/pretrain_bloom_7b_ptd.sh
@ -1,86 +0,0 @@
-#!/bin/bash
-
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# output data path
-CHECKPOINT_PATH='./ckpt'
-LOGS_PATH='./logs/'
-mkdir -p $LOGS_PATH
-
-# train parameter 
-MASTER_ADDR=localhost
-MASTER_PORT=5999
-GPUS_PER_NODE=8
-NNODES=1
-
-MICRO_BATCH_SIZE=4
-GLOBAL_BATCH_SIZE=512
-
-NLAYERS=30
-NHIDDEN=4096
-NHEADS=32
-SEQ_LEN=2048
-
-SAVE_INTERVAL=250
-
-TRAIN_SAMPLES=220_000_000  # 450B tokens
-LR_DECAY_SAMPLES=200_000_000  # Decay for the first 410B tokens then continue at fixed --min-lr
-LR_WARMUP_SAMPLES=183_105  # 375M tokens
-
-# dataset path
-TOKENIZER_NAME_OR_PATH=/home/bloom_data/vocab_file/
-DATA_PATH=/home/bloom_data/oscar_data_1g/my-gpt2_text_document
-
-DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --max_restarts 0 --tee 3"
-
-TRANSFORMERS_OFFLINE=1  \
-    python -m torch.distributed.run $DISTRIBUTED_ARGS \
-    pretrain_bloom.py \
-    --tokenizer-type PretrainedFromHF \
-    --embed-layernorm \
-    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
-    --data-path $DATA_PATH \
-    --pad-vocab-size-to 250880 \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --num-layers $NLAYERS \
-    --hidden-size $NHIDDEN \
-    --num-attention-heads $NHEADS \
-    --seq-length $SEQ_LEN \
-    --max-position-embeddings $SEQ_LEN \
-    --micro-batch-size $MICRO_BATCH_SIZE \
-    --rampup-batch-size 192 16 9_765_625 \
-    --global-batch-size $GLOBAL_BATCH_SIZE \
-    --train-samples $TRAIN_SAMPLES \
-    --init-method-std 0.0048 \
-    --fp16 \
-    --seed 42 \
-    --position-embedding-type alibi \
-    --optimizer adam \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --adam-eps 1e-8 \
-    --lr 1.2e-4 \
-    --min-lr 6e-6 \
-    --lr-decay-style cosine \
-    --lr-decay-samples $LR_DECAY_SAMPLES \
-    --lr-warmup-samples $LR_WARMUP_SAMPLES \
-    --clip-grad 1.0 \
-    --weight-decay 1e-1 \
-    --log-interval 1 \
-    --save-interval $SAVE_INTERVAL \
-    --eval-interval 1000 \
-    --eval-iters 1 \
-    --save $CHECKPOINT_PATH \
-    --load $CHECKPOINT_PATH \
-    --data-impl mmap \
-    --distributed-backend nccl \
-    --sequence-parallel \
-    --no-add-gate \
-    --add-bias-linear \
-    --query-key-layer-scaling \
-    --no-attention-softmax-in-fp32 \
-    --no-untie-embeddings-and-output-weights
-    
--- a/examples/bloom/tune_bloom_7b1.sh
+++ b/examples/bloom/tune_bloom_7b1.sh
@ -1,116 +0,0 @@
-#!/bin/bash
-
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# output data path
-CHECKPOINT_PATH='./ckpt'
-LOGS_PATH='./logs/'
-mkdir -p $LOGS_PATH
-
-# train parameter
-MASTER_ADDR=localhost
-MASTER_PORT=5999
-GPUS_PER_NODE=8
-NNODES=1
-
-MICRO_BATCH_SIZE=4
-GLOBAL_BATCH_SIZE=512
-
-NLAYERS=30
-NHIDDEN=4096
-NHEADS=32
-SEQ_LEN=2048
-
-SAVE_INTERVAL=250
-
-TRAIN_SAMPLES=220_000_000  # 450B tokens
-LR_DECAY_SAMPLES=200_000_000  # Decay for the first 410B tokens then continue at fixed --min-lr
-LR_WARMUP_SAMPLES=183_105  # 375M tokens
-
-# dataset path
-TOKENIZER_NAME_OR_PATH=./dataset/bloom_vocab/vocab_file/
-DATA_PATH=./dataset/oscar_data_1g/my-gpt2_text_document
-
-ZERO_STAGE=0 # important: bf16 must use z0! it implements its own zero stage 1 equivalent
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "train_batch_size": $GLOBAL_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --max_restarts 0 --tee 3"
-
-TRANSFORMERS_OFFLINE=1  \
-    python -m torch.distributed.run $DISTRIBUTED_ARGS \
-    pretrain_bloom.py \
-    --tokenizer-type PretrainedFromHF \
-    --is-instruction-dataset \
-    --embed-layernorm \
-    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
-    --data-path $DATA_PATH \
-    --pad-vocab-size-to 250880 \
-    --tensor-model-parallel-size 8 \
-    --pipeline-model-parallel-size 1 \
-    --num-layers $NLAYERS \
-    --hidden-size $NHIDDEN \
-    --num-attention-heads $NHEADS \
-    --normalization LayerNorm \
-    --seq-length $SEQ_LEN \
-    --max-position-embeddings $SEQ_LEN \
-    --micro-batch-size $MICRO_BATCH_SIZE \
-    --rampup-batch-size 192 16 9_765_625 \
-    --global-batch-size $GLOBAL_BATCH_SIZE \
-    --train-samples $TRAIN_SAMPLES \
-    --init-method-std 0.0048 \
-    --fp16 \
-    --seed 42 \
-    --position-embedding-type alibi \
-    --optimizer adam \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --adam-eps 1e-8 \
-    --lr 1.2e-4 \
-    --min-lr 6e-6 \
-    --lr-decay-style cosine \
-    --lr-decay-samples $LR_DECAY_SAMPLES \
-    --lr-warmup-samples $LR_WARMUP_SAMPLES \
-    --clip-grad 1.0 \
-    --weight-decay 1e-1 \
-    --log-interval 1 \
-    --save-interval $SAVE_INTERVAL \
-    --eval-interval 1000 \
-    --eval-iters 1 \
-    --save $CHECKPOINT_PATH \
-    --load $CHECKPOINT_PATH \
-    --data-impl mmap \
-    --deepspeed \
-    --deepspeed_config ${config_json} \
-    --zero-stage ${ZERO_STAGE} \
-    --distributed-backend nccl \
-    --no-add-gate \
-    --add-bias-linear \
-    --query-key-layer-scaling \
-    --no-attention-softmax-in-fp32 \
-    --sequence-parallel \
-    --lora-target-modules query_key_value dense \
-    --no-untie-embeddings-and-output-weights
--- a/examples/gpt/pretrain_gpt_ptd_8p.sh
+++ b/examples/gpt/pretrain_gpt_ptd_8p.sh
@ -1,61 +0,0 @@
-# This is an example: train gpt using PTD,
-# the number of parameters is not aligned
-
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-DATA_PATH=./dataset/enwiki-gpt/gpt_text_sentence
-CHECKPOINT_PATH=./ckpt
-
-DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_gpt.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 2 \
-       --pipeline-model-parallel-size 2 \
-       --num-layers 8 \
-       --hidden-size 4096 \
-       --num-attention-heads 32 \
-       --micro-batch-size 2 \
-       --global-batch-size 16 \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --train-iters 500000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_PATH \
-       --load $CHECKPOINT_PATH \
-       --data-path $DATA_PATH \
-       --vocab-file ./dataset/gpt2-vocab.json \
-       --merge-file ./dataset/gpt2-merges.txt \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --recompute-method block \
-       --recompute-granularity full \
-       --no-add-gate \
-       --attention-dropout 0.1 \
-       --hidden-dropout 0.1 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --add-bias-linear \
-       --query-key-layer-scaling \
-       --no-attention-softmax-in-fp32 \
-       --no-untie-embeddings-and-output-weights \
-       --fp16 | tee logs/train.log
--- a/examples/intern/README.md
+++ b/examples/intern/README.md
@ -1,325 +0,0 @@
-# Intern-LM
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/intern/README_en.md">English</a> </b> 
-    </p>
-</p>
-
-# 目录
-
- [Internlm-7B](#internlm-7b)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-    - [推理](#推理)
-    - [评估](#评估)
- [Internlm-65B](#internlm-65b)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-
-# InternLM-7B
-
-## 训练
-
-InternLM-7B 训练的硬件配置如下:
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 8 x Ascend NPUs |
-
-
-### 脚本
-
-1. 拷贝仓库到本地服务器
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.7
-conda create -n test python=3.8
-conda activate test
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post20231124_cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-# 安装 apex
-pip install apex-0.1_ascend_20231124-cp38-cp38-linux_aarch64.whl
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# 安装其余依赖包
-pip install -r requirements.txt 
-```
-
-3. 下载 Internlm-7B [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)
-
-```shell
-#!/bin/bash
-mkdir -p dataset/internlm
-cd ./dataset/internlm
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-cd ../..
-```
-
-4. 下载 Internlm-7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/internlm \
-    --output-prefix ./dataset/alpaca \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-5. 权重格式转换
-
-下载 Internlm-7B [权重](https://huggingface.co/internlm/internlm-7b/tree/main) 
-
-```text
-# 请注意，如果要加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
-# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
-# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
-
-# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None:
-    return False
-
-# 修改后
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None or len(zero_sd_list) == 0:
-    return False
-```
-
-```shell
-mkdir model_from_hf
-cd ./model_from_hf
-# 必须安装 git-lfs
-git clone https://huggingface.co/internlm/internlm-7b
-cd ..
-```
-
-将模型权重从 huggingface 格式转换为 AscendSpeed 可以处理的格式
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/internlm-7b/ \
-    --output-model-dir ./model_weights \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --bias \
-    --deepspeed \
-```
-
-6. 配置 Internlm-7B 预训练脚本
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# 修改数据集，词表，权重等路径
-TOKENIZER_PATH=./dataset/internlm  #tokenizer path
-DATA=./dataset/alpaca_text_document  #processed dataset
-CHECKPOINT=./model_weights/
-```
-
-7. 启动 Internlm-7B 预训练脚本
-
-```shell
-bash examples/intern/pretrain_internlm_7b_zero.sh 
-```
-
-
-### 性能
-
-#### 吞吐
-
-Internlm-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 | 模型          | 总迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|----|-------------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | Internlm-7B | 2048 | 13.000              | 2943                 | 19684.6            | 145.69           |
-| 参考 | Internlm-7B | - | -              | 4078                 |  -            | -           |
-
-
-
-#### 精度
-
-NPU vs 参考 （无预训练权重） loss 对比和相对误差
-![NPU-Loss-and-Relative-Error](../../sources/images/intern7b_loss.png)
-
-NPU vs 参考 （有预训练权重） loss 对比和相对误差
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/intern7b_loss_with_weights.png)
-
-#### 推理
-<a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/intern/generate_internlm_7b_deepspeed.sh">推理脚本</a>：
-examples/intern/generate_internlm_7b_deepspeed.sh 
-```
-bash examples/intern/generate_internlm_7b_deepspeed.sh
-```
-推理举例：
-![Internlm-7b-inference](../../sources/images/internlm7b-inference.png)
-
-#### 评估
-<a href="https://gitee.com/ascend/AscendSpeed/blob/master/tasks/evaluation/eval_internlm.sh">评估脚本</a>:
-tasks/evaluation/eval_internlm.sh 
-```
-bash  tasks/evaluation/eval_internlm.sh
-```
-InternLM-7B在**Ascend NPU**中的评测表现：
-
-| 任务                                                  | 模型        | 昇腾值  | 社区值  |
-|-----------------------------------------------------|-----------|------|------|
-| [MMLU](https://huggingface.co/datasets/cais/mmlu) | LLaMA-7B  | 48.8 | [51.0](https://huggingface.co/internlm/internlm-7b) | 
-
-# InternLM-65B
-
-## 训练
-
-InternLM-65B 训练的硬件配置如下:
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 32 x Ascend NPUs |
-
-
-### 脚本
-
-1. 拷贝仓库到本地服务器
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.7
-conda create -n test python=3.8
-conda activate test
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post20231124_cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-# 安装 apex
-pip install apex-0.1_ascend_20231124-cp38-cp38-linux_aarch64.whl
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# 安装其余依赖包
-pip install -r requirements.txt 
-```
-
-3. 下载 [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)
-
-```shell
-#!/bin/bash
-mkdir -p dataset/internlm
-cd ./dataset/internlm
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-cd ../..
-```
-
-4. 下载 Internlm-65B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/internlm \
-    --output-prefix ./dataset/alpaca \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-6. 配置 Internlm-65B 预训练脚本
-
-```shell
-# 修改 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# 修改数据集，词表，权重等路径
-TOKENIZER_PATH=./dataset/internlm  #tokenizer path
-DATA=./dataset/alpaca_text_document  #processed dataset
-CHECKPOINT=./model_weights/
-```
-
-7. 启动 Internlm-65B 预训练脚本
-
-```shell
-bash examples/intern/pretrain_internlm_65b_zero.sh 
-```
-
-
-### 性能
-
-#### 吞吐
-
-Internlm-65B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 | 模型          | 总迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|----|-------------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | Internlm-65B | 50000 | 5.33              | 342                 | 24            | 137.8           |
-| Reference | Internlm-65B | - | -              | 414                 | -            | -           |
-
-
-
-#### 精度
-
-NPU vs 参考 （无预训练权重） loss 对比和相对误差
-![NPU-Loss-and-Relative-Error](../../sources/images/intern65b_loss.png)
--- a/examples/intern/README_en.md
+++ b/examples/intern/README_en.md
@ -1,340 +0,0 @@
-# Intern-LM
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/intern/README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-#  Contents
-
- [Contents](#contents)
- [Internlm-7B](#internlm-7b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#Inference)
-  - [Evaluation](#Evaluation)
- [Contents](#contents)
- [Internlm-65B](#internlm-65b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-
-# InternLM-7B
-
-## Training
-
-Here's a hardware summary of pre-training InternLM-7B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.7
-conda create -n test python=3.7
-conda activate test
-# install torch and torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post20231124_cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-# install apex
-pip install apex-0.1_ascend_20231124-cp38-cp38-linux_aarch64.whl
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# install other packages
-pip install -r requirements.txt 
-```
-*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-
-```python
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-    
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-```
-3. Download the Internlm-7B tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main) 
-
-```shell
-#!/bin/bash
-mkdir -p dataset/internlm
-cd ./dataset/internlm
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-cd ../..
-```
-
-4. Prepare dataset. Download the Internlm-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/internlm \
-    --output-prefix ./dataset/alpaca \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-5. Weights convert
-
-Download the Internlm-7B checkpoint from [here](https://huggingface.co/internlm/internlm-7b/tree/main) 
-```shell
-mkdir model_from_hf
-cd ./model_from_hf
-# you must install git-lfs
-git clone https://huggingface.co/internlm/internlm-7b
-cd ..
-```
-
-In order to adapt to the internlm-7B model, the following script is used to convert the model pre-training weights.
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/internlm-7b/ \
-    --output-model-dir ./model_weights \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --bias \
-    --deepspeed \
-```
-
-6. Config Internlm-7B pre-training script.
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./dataset/internlm  #tokenizer path
-DATA=./dataset/alpaca_text_document  #processed dataset
-CHECKPOINT=./model_weights/
-```
-
-7. Launch Internlm-7B pre-training script.
-
-```shell
-bash examples/intern/pretrain_internlm_7b_zero.sh 
-```
-
-
-### Performance
-
-#### Machine performance
-
-The performance of Internlm-7B in **Ascend NPU** and **Reference**:
-
-| Device | Model       | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|----|-------------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | Internlm-7B | 2048 | 13.000              | 2943                 | 19684.6            | 145.69           |
-| benchmarks | Internlm-7B | - | -              | 4078                 |  -            | -           |
-
-
-
-#### Accuracy of the loss
-
-NPU vs GPU loss.
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.007, less than 2%, the maximum relative error is 0.09, and the maximum absolute error is 0.4859. The precision meets the requirements.
-
-NPU vs GPU without weights loss and relative error.
-![NPU-Loss-and-Relative-Error](../../sources/images/intern7b_loss.png)
-
-NPU vs GPU with weights loss and relative error.
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/intern7b_loss_with_weights.png)
-
-#### Inference
-<a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/intern/generate_internlm_7b_deepspeed.sh">Inference script</a>：
-examples/intern/generate_internlm_7b_deepspeed.sh 
-```
-bash examples/intern/generate_internlm_7b_deepspeed.sh
-```
-
-Inference case:
-![Internlm-7b-inference](../../sources/images/internlm7b-inference.png)
-
-#### Evaluation
-<a href="https://gitee.com/ascend/AscendSpeed/blob/master/tasks/evaluation/eval_internlm.sh">Evaluation script</a>:
-tasks/evaluation/eval_internlm.sh 
-```
-bash  tasks/evaluation/eval_internlm.sh
-```
-The evaluation performance of LLaMA-7B/13B in **Ascend NPU**:
-
-| Task    | Model     | NPU  | Benchmark |
-|-----------------------------------------------------|-----------|------|------|
-| [MMLU](https://huggingface.co/datasets/cais/mmlu) | LLaMA-7B  | 48.8 | [51.0](https://huggingface.co/internlm/internlm-7b) | 
-
-# InternLM-65B
-
-## Training
-
-Here's a hardware summary of pre-training InternLM-65B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               32 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.7
-conda create -n test python=3.7
-conda activate test
-# install torch and torch_npu
-pip install torch-2.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post20231124_cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-# install apex
-pip install apex-0.1_ascend_20231124-cp38-cp38-linux_aarch64.whl
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# install other packages
-pip install -r requirements.txt 
-```
-*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-
-```python
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-    
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-```
-3. Download tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main) 
-
-```shell
-#!/bin/bash
-mkdir -p dataset/internlm
-cd ./dataset/internlm
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-cd ../..
-```
-
-4. Prepare dataset. Download the Internlm-65B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/internlm \
-    --output-prefix ./dataset/alpaca \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-5. Config Internlm-65B pre-training script.
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./dataset/internlm  #tokenizer path
-DATA=./dataset/alpaca_text_document  #processed dataset
-CHECKPOINT=./model_weights/
-```
-
-6. Launch Internlm-65B pre-training script.
-
-```shell
-bash examples/intern/pretrain_internlm_65b_zero.sh 
-```
-
-
-### Performance
-
-#### Machine performance
-
-The performance of Internlm-65B in **Ascend NPU** and **Reference**:
-
-| Device | Model       | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|----|-------------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | Internlm-65B | 50000 | 5.33              | 342                 | 24            | 137.8           |
-| Reference | Internlm-65B | - | -              | 414                 | -            | -           |
-
-
-
-#### Accuracy of the loss
-
-NPU vs GPU without weights loss and relative error.
-![NPU-Loss-and-Relative-Error](../../sources/images/intern65b_loss.png)
-\
-\
-<font size=1>If the download of the file fails using 'wget' , you can download it manually while ensuring website security.</font>
--- a/examples/intern/generate_internlm_7b_deepspeed.sh
+++ b/examples/intern/generate_internlm_7b_deepspeed.sh
@ -1,57 +0,0 @@
-#!/bin/bash
-
-export TOKENIZERS_PARALLELISM=false
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="Your checkpoint path"
-VOCAB_FILE="Your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 1000,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 8
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_llama.py \
-       --row-col-parallel-linear-bias \
-       --no-contiguous-buffers-in-local-ddp \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size ${MICRO_BATCH_SIZE} \
-       --seq-length 2048 \
-       --max-new-tokens 64 \
-       --seed 42 \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --no-pipeline-parallel \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/intern/pretrain_internlm_65b_ptd_32p.sh
+++ b/examples/intern/pretrain_internlm_65b_ptd_32p.sh
@ -1,68 +0,0 @@
-# This is an example: train llama using PTD.
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=4
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-TOKENIZER_PATH=./internlm-65b-hf
-DATA_PATH=./dataset/alpaca_text_document
-CHECKPOINT_PATH=./ckpt
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_intern.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 4 \
-       --num-layers 80 \
-       --hidden-size 8192 \
-       --ffn-hidden-size 22016 \
-       --num-attention-heads 64 \
-       --micro-batch-size 2 \
-       --global-batch-size 128 \
-       --seq-length 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --max-position-embeddings 2048 \
-       --train-iters 50000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_PATH \
-       --load $CHECKPOINT_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 1e-4 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --auto-recompute-device-size 55296 \
-       --initial-loss-scale 524288.0 \
-       --optimizer adam \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.95 \
-       --adam-eps 1e-8 \
-       --clip-grad 1.0 \
-       --use-flash-attn \
-       --use-fused-rotary-pos-emb \
-       --sequence-parallel \
-       --mlp-layer-fusion \
-       --bf16 | tee logs/train.log
--- a/examples/intern/pretrain_internlm_7b_zero.sh
+++ b/examples/intern/pretrain_internlm_7b_zero.sh
@ -1,101 +0,0 @@
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=1
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-DATA=../../datasets/dataset/alpaca_gpu/alapaca_packed_input_ids_document
-CHECKPOINT=../../model/model_weights/
-
-DS_CONFIG=ds_config.json
-ZERO_STAGE=2
-GLOBAL_BATCH=256
-MICRO_BATCH=2
-
-cat <<EOT > $DS_CONFIG
-{
-    "fp16": {
-        "enabled": false,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "bf16": {
-        "enabled": true
-    },
- 
-    "optimizer": {
-        "type": "Adam"
-    },
- 
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "allgather_partitions": true,
-        "allgather_bucket_size": 5e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 5e8,
-        "contiguous_gradients": true
-    },
- 
-    "gradient_accumulation_steps": 16,
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true
-}
-EOT
- 
-ds_args=""
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}" 
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
- 
- 
-deepspeed  pretrain_intern.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 1 \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 32 \
-       --hidden-size 4096 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --train-iters 500000 \
-       --lr-decay-iters 320000 \
-       --data-path $DATA \
-       --load $CHECKPOINT \
-       --tokenizer-name-or-path ../../datasets/dataset/internlm \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --use-flash-attn \
-       --auto-recompute-device-size 46080 \
-       --use-fused-rmsnorm \
-       --use-fused-rotary-pos-emb \
-       $ds_args \
-       --bf16 | tee logs/train.log
--- a/examples/llama/README.md
+++ b/examples/llama/README.md
@ -1,617 +0,0 @@
-# LLaMA-7B/13B
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama/README_en.md">English</a></b> 
-    </p>
-</p>
-
-
-
-
- [LLaMA-7B/13B](#LLaMA-7B/13B)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-    - [推理](#推理)
-      - [脚本](#脚本)
-    - [使用基线数据集进行评估](#使用基线数据集进行评估)
- [LLaMA-33B/65B](#LLaMA-33B/65B)
-  - [训练](#训练)
-    - [数据集](#数据集)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-      - [精度](#精度)
-    - [推理](#推理)
-      - [脚本](#脚本)
-    - [使用基线数据集进行评估](#使用基线数据集进行评估)
- [引用](#引用)
-
-## 训练
-
-LLaMA-7B/13B 训练的硬件配置如下:
-
-|  硬件  |   设备   |      配置          |
-|:----:|:------:|:----------------:|
-| NPU  |  A+K   | 8 x Ascend NPUs  |
-| NPU  |  A+X   | 16 x Ascend NPUs |
-
-
-### 脚本
-
-1. 拷贝仓库到个人服务器
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. 搭建环境
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-# 安装 torch 和 torch_npu
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.post5_XXXXXX-cp38-cp38m-linux_aarch64.whl
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# 安装其余依赖包
-pip install -r requirements.txt 
-```
-
-3. 下载 LLaMA-7B/13B [词表文件](https://huggingface.co/ruibin-wang/llama-7b-hf/tree/main) 
-
-
-```shell
-  #!/bin/bash
-  mkdir -p dataset/llama
-  cd ./dataset/llama
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/config.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/generation_config.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/special_tokens_map.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer.model
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer_config.json
-  cd ..
-```
-
-
-4. 下载 LLaMA-7B/13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/llama \
-    --output-prefix ./dataset/llama \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-5. 权重转换
-
-```text
-# 请注意，如果要加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
-# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
-# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
-
-# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None:
-    return False
-
-# 修改后
-zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-if zero_sd_list is None or len(zero_sd_list) == 0:
-    return False
-```
-
-
-下载 LLaMA-7B [权重](https://huggingface.co/ruibin-wang/llama-7b-hf/tree/main) 或 LLaMA-13B [权重](https://huggingface.co/ruibin-wang/llama-13b-hf/tree/main)
-```shell
-  mkdir model_from_hf
-  cd ./model_from_hf
-  # 需要安装 git-lfs
-  git clone https://huggingface.co/ruibin-wang/llama-7b-hf
-  cd ..
-```
-or 
-
-```shell
-  mkdir model_from_hf
-  cd ./model_from_hf
-  # 需要安装 git-lfs
-  git clone https://huggingface.co/ruibin-wang/llama-13b-hf
-  cd ..
-```
-
-将模型权重文件从 huggingface 格式转化为 AscendSpeed 格式
-
-LLaMA-7B
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-7b/ \
-    --output-model-dir ./model_weights/llama-7b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --deepspeed
-```
-
-LLaMA-13B
-```shell
-# 单机八卡
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-13b/ \
-    --output-model-dir ./model_weights/llama-13b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 8 \
-    --type 13B
-    
-# 单机16卡
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-13b/ \
-    --output-model-dir ./model_weights/llama-13b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 2 \
-    --type 13B
-```
-
-6. 配置 LLaMA-7B/13B 预训练脚本
-
-```shell
-# 设置 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# 修改数据集路径，权重路径，词表路径等
-TOKENIZER_PATH=./dataset/llama  #tokenizer 路径
-DATA=./dataset/llama_text_document  #数据集 路径
-CHECKPOINT=./model_weights/
-
-# 如果不需要加载权重，就移除 `--load` 参数
-# 如果是指令数据集，请添加 `--is-instruction-dataset` 参数，否则请移除该参数
-```
-
-7. 启动 LLaMA-7B/13B 预训练脚本
-
-LLaMA-7B
-```shell
-bash examples/llama/pretrain_llama_7B_zero_8p.sh
-```
-
-LLaMA-13B
-```shell
-# 单机8卡
-bash examples/llama/pretrain_llama_13B_ptd_8p.sh 
-# 单机16卡
-bash examples/llama/pretrain_llama_13B_ptd_16p.sh
-```
-
-### 性能
-
-#### 吞吐
-
-LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备   | 硬件        | 模型        | 迭代数  | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | 910 1*8p  | LLaMA-7B  | 2048 | 1.83               | 3763                 | 4.35            | 159.9            |
-| 参考   | -         | LLaMA-7B  | 2048 | 1.85               | 3804                 | 4.31            | 161.5            |
-| NPUs | 910 1*8p  | LLaMA-13B | 2048 | 0.925              | 1894                 | 17.27           | 205.57           |
-| NPUs | 910 1*16p | LLaMA-13B | 2048 | 0.88               | 1800                 | 36.32           | 195.58           |
-| 参考   | -         | LLaMA-13B | 2048 | 0.98               | 2012                 | 16.33           | 217.37           |
-
-
-
-
-#### 精度
-
-LLama-7b NPU vs 参考 loss.
-
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama7b-loss-with-weight.png)
-
-LLama-13b NPU vs 参考 loss.
-
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama13b-loss-with-weight.png)
-
-## 推理
-
-我们支持使用 LLaMA-7B 和 LLaMA-13B 进行文本生成的推理。
-推理与预训练不同，比如我们需要加载预训练权重和输出样本的长度：
-
-配置LLaMA-7B推理脚本`examples/llama/generate_llama_7B_deepspeed.sh`和LLaMA-13B推理脚本`examples/llama/generate_llama_13B_tp1_pp8.sh`。
-
-```shell
-# 修改模型权重路径和分词器路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-LLaMA-7B:
-```shell
-bash ./examples/llama/generate_llama_7B_deepspeed.sh
-```
-
-LLaMA-13B:
-```shell
-bash ./examples/llama/generate_llama_13B_tp1_pp8.sh
-```
-
-部分推理样本如下：
-
-LLaMA-7B:
-
-![llama-7B_generate.png](../../sources/images/llama/llama-7B_generate.png)
-
-LLaMA-13B:
-
-![llama-13B_generate.png](../../sources/images/llama/llama-13B_generate.png)
-
-
-## 使用基线数据集进行评估
-
-我们使用 BBH benchmark 来评估我们的模型。Benchmark下载[此处](https://huggingface.co/datasets/lukaemon/bbh)。
-
-
-配置LLaMA-7B评估脚本 `tasks/evaluation/evaluate_llama_7b_ptd.sh` 和 LLaMA-13B评估脚本 `tasks/evaluation/evaluate_llama_13b_ptd.sh`：
-
-修改权重路径, 词表路径和数据集任务路径：
-```shell
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-DATA_PATH="./bbh/data/test/"
-TASK="bbh"
-```
-
-修改最大生成词参数：
-```shell
--max-new-tokens 32 
-```
-
-```text
-# 请注意，评估时需要修改一个deepspeed的bug：
-# 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
-# self.total_loss += self.loss.detach()
-```
-
-开始评估：
-```shell
-bash tasks/evaluation/evaluate_llama_7b_ptd.sh
-bash tasks/evaluation/evaluate_llama_13b_ptd.sh
-```
-
-LLaMA-7B/13B在**Ascend NPU**中的评测表现：
-
-| 任务                                                  | 模型        | 昇腾值  | 社区值  |
-|-----------------------------------------------------|-----------|------|------|
-| [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-7B  | 33.4 | [33.5](https://opencompass.org.cn/dataset-detail/BBH) | 
-| [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-13B | 39.2 | [37.9](https://opencompass.org.cn/dataset-detail/BBH) |
-
-# LLaMA-33B/65B
-
-LLaMA 模型源于: [LLaMA: OPen and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971v1.pdf)
-
->Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
-
-## 训练
-
-LLaMA-33B/65B 训练的硬件配置:
-
-|  硬件 |        配置        |
-|:---:|:----------------:|
-| NPU | 32 x Ascend NPUs |
-
-### 数据集
-模型使用 alpaca 数据集训练
-
-### 脚本
-1. 拷贝仓库到个人服务器
-```shell
-git clone https://gitee.com/ascend.AscendSpeed.git
-cd AscendSpeed
-mkdir logs
-mkdir ckpt
-```
-2. 搭建环境
-```shell
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# 安装 torch 和 torch_npu
-# ARM
-wget https://download.pytorch.org/whl/torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
-
-# X86
-#pip install torch==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
-#pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
-
-# 安装 megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# 安装 deepspeed 和 deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# 安装其他包
-pip install -r requirements.txt
-```
-
-3. 下载权重
-
-llama-33B 权重
-```shell
-mkdir tokenizer
-cd ./tokenizer
-
-# 需要安装 git-lfs
-git lfs install
-git clone https://huggingface.co/pinkmanlove/llama-33b-hf
-cd ..
-```
-
-llama-65B 权重
-```shell
-mkdir tokenizer
-cd ./tokenizer
-
-# 需要安装 git-lfs
-git lfs install
-git clone https://huggingface.co/pinkmanlove/llama-65b-hf
-cd ..
-```
-
-4. 预训练权重从 huggingface 格式转换为 AscendSpeed 格式
-
-llama-33B
-```shell
-mkdir model_weights
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-      --input-model-dir ./tokenizer \
-      --output-model-dir ./model_weights \
-      --tensor-model-parallel-size 4 \
-      --pipeline-model-parallel-size 4 \
-      --merge-mlp \
-      --type 30B
-```
-
-llama-65B
-```shell
-mkdir model_weights
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-      --input-model-dir ./tokenizer \
-      --output-model-dir ./model_weights \
-      --tensor-model-parallel-size 8 \
-      --pipeline-model-parallel-size 4 \
-      --type 65B
-```
-
-5. 下载数据集
-```shell
-# 下载 alpaca 数据集
-wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason
-
-# 下载 tokenizer 配置 和 (可选择的) 权重:
-# http://huggingface.co/pinkmanlove/llama-33b-hf
-# http://huggingface.co/pinkmanlove/llama-65b-hf
-# 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" （这是hf的一个bug）
-mkdir dataset
-python tools/preprocess_data.py --input alpaca_data.json\
-                                --output-prefix dataset/alpaca\
-                                --tokenizer-type PretrainedFromHF\
-                                --tokenizer-name-or-path llama-33b-hf
-                               #--tokenizer-name-or-path llama-65b-hf
-                                --tokenizer-not-use-fast
-                                --handler-name GeneralInstructionHandler
-```
-
-6. 配置 llama-33B/65B 预训练脚本:
-
-AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
-
-AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
-
-```bash
-# 修改 ascend-toolkit 路径
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HEEL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# 配置词表和数据路径等
-TOKENIZER_PATH=./dataset/llama_tokenizer # line 16
-DATA_PATH=./dataset/llama_text_document # line 17
-```
-
-7. 启动预训练脚本:
-
-启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
-```bash
-bash examples/llama/pretrain_llama_33B_ptd_32p.sh
-```
-
-启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
-```bash
-bash examples/llama/pretrain_llama_65B_ptd_32p.sh
-```
-
-为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本):
-
-```shell
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=4
-NODE_RANK=0
-```
-
-训练log如下:
-
-```Shell
- iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
-iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
-time (ms)
-```
-
-### 性能
-
-#### 吞吐
-
-LLaMA-33B/65B在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-|  设备  |    模型     | tokens吞吐 (tokens/s/p) |
-|:----:|:---------:|:---------------------:|
-|  参考  | llama-33B |          776          |
-| NPUs | llama-33B |          621          |
-|  参考  | llama-65B |          426          |
-| NPUs | llama-65B |          348          |
-
-
-#### 精度
-
-NPU vs 参考 loss 和相对误差：
-
-LLaMa-33B
-
-![NPU-LOSS](../../sources/images/llama/llama33B_loss.png)
-
-
-![NPU-Relative-Error](../../sources/images/llama/llama33B_relative_error.png)
-
-
-
-LLaMa-65B
-
-![NPU-LOSS](../../sources/images/llama/loss_chart.png)
-
-![NPU-Relative-Error](../../sources/images/llama/compare_chart.png)
-
-
-## 推理
-
-我们支持使用 LLaMA-33B 和 LLaMA-65B 进行文本生成的推理。
-推理与预训练不同，比如我们需要加载预训练权重和输出样本的长度：
-
-配置LLaMA-33B推理脚本`examples/llama/generate_llama_33B_ptd.sh`。
-
-配置LLaMA-65B推理脚本`examples/llama/generate_llama_65B_tp8_pp1.sh`。
-
-```shell
-# 修改模型权重路径和分词器路径
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-LLaMA-33B:
-```shell
-bash ./examples/llama/generate_llama_33B_ptd.sh
-```
-LLaMA-65B:
-```shell
-bash ./examples/llama/generate_llama_65B_tp8_pp1.sh
-```
-
-部分推理样本如下：
-
-LLaMA-33B:
-
-![llama-13B_generate.png](../../sources/images/llama/llama33B_generate.png)
-
-LLaMA-65B:
-
-![llama-65B_generate.png](../../sources/images/llama/llama65B_generate.png)
-
-## 使用基线数据集进行评估
-
-我们使用 Boolq benchmark 来评估我们的模型。Benchmark下载[此处](https://huggingface.co/datasets/boolq)。
-
-配置LLaMA-33B评估脚本：
-
-```shell
-    CHECKPOINT=./llama-33b-tp4-pp2/
-    VOCAB_FILE=./llama-33b-hf/
-    # 配置任务和数据路径
-    DATA_PATH="./boolq/data/test/"
-    TASK="boolq"
-    # 配置生成参数
-    python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py   \
-         --task-data-path $DATA_PATH \
-         --task $TASK\
-         --seq-length 1024 \
-         --max-new-tokens 2 \
-         --max-position-embeddings 1024 \
-         --tensor-model-parallel-size 4  \
-         --pipeline-model-parallel-size 2  \
-         --num-layers 60 \
-         --hidden-size 6656  \
-         --ffn-hidden-size 17920 \
-         --load ${CHECKPOINT}  \
-         --num-attention-heads 52  \
-         --tokenizer-type PretrainedFromHF  \
-         --tokenizer-name-or-path ${VOCAB_FILE} \
-         --tokenizer-not-use-fast \
-         --fp16  \
-         --micro-batch-size 1  \
-         --position-embedding-type rope \
-         --normalization RMSNorm \
-         --mlp-layer-fusion \
-         --seed 42
-```
-
-```shell
-# 开始评估
-# llama-65B评估
-bash tasks/evaluation/evaluate_llama_65B_tp8_pp1.sh
-```
-
-LLaMA-33B和LLaMA-65B在**Ascend NPU**中的评测表现：
-
-| 任务                                             | 模型        | 昇腾值  | 社区值                                                                 |
-|------------------------------------------------|-----------|------|---------------------------------------------------------------------|
-| [Boolq](https://huggingface.co/datasets/boolq) | LLaMA-33B | 83.2 | [83.1](https://paperswithcode.com/sota/question-answering-on-boolq) |
-| [Boolq](https://huggingface.co/datasets/boolq) | LLaMA-65B | 85.7 | [86.6](https://paperswithcode.com/sota/question-answering-on-boolq) |
-
-## 引用
-
-
-```shell
-@article{Touvron2023llama,
-  title={LLaMA: OPen and Efficient Foundation Language Models},
-  author={Hugo Touvron*, Thibaut Lavril*, Gautier Izacard*, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal,
-  Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave*, Guillaume Lample*},
-  journal={arXiv preprint arXiv:2302.13971},
-  year={2023}}
-```
--- a/examples/llama/README_en.md
+++ b/examples/llama/README_en.md
@ -1,614 +0,0 @@
-# LLaMA-7B/13B
-
-<p align="left">
-        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama/README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-
- [LLaMA-7B/13B](#llama-7b13b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#Inference)
-    - [Script](#script)
-  - [Evaluation with Numerous Benchmarks](#Evaluation-with-Numerous-Benchmarks)
- [LLaMA-33B/65B](#llama-65b)
-  - [Training](#pre-training)
-    - [Datasets](#datasets)
-    - [Script](#script-1)
-    - [Performance](#performance-1)
-      - [Machine performance](#machine-performance-1)
-      - [Accuracy of the loss](#accuracy-of-the-loss-1)
-  - [Inference](#Inference)
-    - [Script](#script)
-  - [Evaluation with Numerous Benchmarks](#Evaluation-with-Numerous-Benchmarks)
- [Citation](#citation)
-
-## Training
-
-Here's a hardware summary of pre-training LLaMA-7B/13B:
-
-| Hardware | Device |      Value       |
-|:--------:|:------:|:----------------:|
-|   NPU    |  A+K   | 8 x Ascend NPUs  |
-|   NPU    |  A+X   | 16 x Ascend NPUs |
-
-
-### Script
-
-1. Clone the repository to your local server:
-```shell
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
-mkdir logs
-mkdir ckpt
-```
-
-2. Build environment
-
-```bash
-# python3.8
-conda create -n test python=3.8
-conda activate test
-# install torch and torch_npu
-pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-pip install torch_npu-2.1.0.post5_XXXXXX-cp38-cp38m-linux_aarch64.whl
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-# install other packages
-pip install -r requirements.txt 
-```
-*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-
-```python
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-    
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-```
-3. Download the LLaMA-7B/13B tokenizer model and file from [here](https://huggingface.co/ruibin-wang/llama-7b-hf/tree/main) 
-
-
-```shell
-  #!/bin/bash
-  mkdir -p dataset/llama
-  cd ./dataset/llama
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/config.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/generation_config.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/special_tokens_map.json
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer.model
-  wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer_config.json
-  cd ..
-```
-
-
-4. Prepare dataset. Download the LLaMA-7B/13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-```shell
-cd dataset/
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-```
-
-```shell
-#!/bin/bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./dataset/llama \
-    --output-prefix ./dataset/llama \
-    --workers 4 \
-    --log-interval 1000  \
-    --tokenizer-type PretrainedFromHF  \
-    --handler-name AlpacaPretrainHandler  \
-    --tokenizer-not-use-fast \
-    --append-eod
-```
-
-5. Weights convert
-
-Download the LLaMA-7B checkpoint from [here](https://huggingface.co/ruibin-wang/llama-7b-hf/tree/main) 
-```shell
-  mkdir model_from_hf
-  cd ./model_from_hf
-  # you must install git-lfs
-  git clone https://huggingface.co/ruibin-wang/llama-7b-hf
-  cd ..
-```
-
-Download the LLaMA-13B checkpoint from [here](https://huggingface.co/ruibin-wang/llama-13b-hf/tree/main) 
-```shell
-  mkdir model_from_hf
-  cd ./model_from_hf
-  # you must install git-lfs
-  git clone https://huggingface.co/ruibin-wang/llama-13b-hf
-  cd ..
-```
-
-In order to adapt to the LLaMA-7B/13B model, the following script is used to convert the model pre-training weights.
-
-LLaMA-7B
-```shell
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-7b/ \
-    --output-model-dir ./model_weights/llama-7b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 1 \
-    --type 7B \
-    --deepspeed
-```
-
-LLaMA-13B
-```shell
-# Single machine with 8p
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-13b/ \
-    --output-model-dir ./model_weights/llama-13b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 8 \
-    --type 13B
-
-# Single machine with 16p
-mkdir model_weights
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-    --input-model-dir ./model_from_hf/llama-13b/ \
-    --output-model-dir ./model_weights/llama-13b \
-    --tensor-model-parallel-size 1 \
-    --pipeline-model-parallel-size 2 \
-    --type 13B
-```
-
-6. Config LLaMA-7B/13B pre-training script.
-
-```shell
-# modify the script according to your own  ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./dataset/llama  #tokenizer path
-DATA=./dataset/llama_text_document  #processed dataset
-CHECKPOINT=./model_weights/
-```
-*Note that if you do not load weights for pre-training, remove the `--load` parameter from the training script*
-*Note that if you use the instruction dataset, add the `--is-instruction-dataset` parameter from the training script, otherwise please remove the parameter*
-
-7. Launch LLaMA-7B/13B pre-training script.
-
-LLaMA-7B
-```shell
-bash examples/llama/pretrain_llama_7B_zero_8p.sh
-```
-
-LLaMA-13B
-```shell
-# 8p
-bash examples/llama/pretrain_llama_13B_ptd_8p.sh 
-# 16p
-bash examples/llama/pretrain_llama_13B_ptd_16p.sh
-```
-
-### Performance
-
-#### Machine performance
-
-The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**:
-
-| Device    | Hardware  | Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------|
-| NPUs      | 910 1*8p  | LLaMA-7B  | 2048             | 1.83                          | 3763                         | 4.35                      | 159.9                               |
-| Reference | -         | LLaMA-7B  | 2048             | 1.85                          | 3804                         | 4.31                      | 161.5                               |
-| NPUs      | 910 1*8p  | LLaMA-13B | 2048             | 0.925                         | 1894                         | 17.27                     | 205.57                              |
-| NPUs      | 910 1*16p | LLaMA-13B | 2048             | 0.88                          | 1800                         | 36.32                     | 195.58                              |
-| Reference | -         | LLaMA-13B | 2048             | 0.98                          | 2012                         | 16.33                     | 217.37                              |
-
-
-
-#### Accuracy of the loss
-
-LLama-7b with huggingface weights NPU vs GPU loss.
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama7b-loss-with-weight.png)
-
-LLama-13b with huggingface weights NPU vs GPU loss.
-![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama13b-loss-with-weight.png)
-
-
-## Inference
-
-We support AscendSpeed Inference for text generation with LLaMA-7B and LLaMA-13B.
-Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
-
-Config LLaMA-7B inference script `examples/llama/generate_llama_7B_deepspeed.sh` and LLaMA-13B inference script `examples/llama/generate_llama_13B_tp1_pp8.sh`.
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-LLaMA-7B:
-```shell
-bash ./examples/llama/generate_llama_7B_deepspeed.sh
-```
-
-LLaMA-13B:
-```shell
-bash ./examples/llama/generate_llama_13B_tp1_pp8.sh
-```
-
-Some inference samples are as follows:
-
-LLaMA-7B:
-
-![llama-7B_generate.png](../../sources/images/llama/llama-7B_generate.png)
-
-LLaMA-13B:
-
-![llama-13B_generate.png](../../sources/images/llama/llama-13B_generate.png)
-
-
-## Evaluation with Numerous Benchmarks
-
-We use bbh benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/lukaemon/bbh).
-
-Config LLaMA-7B evaluation script `tasks/evaluation/evaluate_llama_7b_ptd.sh` and LLaMA-13B evaluation script `tasks/evaluation/evaluate_llama_13b_ptd.sh`:
-
-Modify checkpoint path, vocab path, dataset path and task:
-
-```shell
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-DATA_PATH="./bbh/data/test/"
-TASK="bbh"
-```
-Change the max new tokens:
-```shell
--max-new-tokens 32 
-```
-
-```text
-# Note that, a deepspeed bug needs to be fixed during evaluation：
-# Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
-# self.total_loss += self.loss.detach()
-```
-
-Start evaluation:
-```shell
-bash tasks/evaluation/evaluate_llama_7b_ptd.sh
-bash tasks/evaluation/evaluate_llama_13b_ptd.sh
-```
-
-The evaluation performance of LLaMA-7B/13B in **Ascend NPU**:
-
-| Task    | Model     | NPU  | Benchmark |
-|---------|-----------|------|-----------|
-| [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-7B  | 33.4 |   [33.5](https://opencompass.org.cn/dataset-detail/BBH)    | 
-| [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-13B | 39.2 |   [37.9](https://opencompass.org.cn/dataset-detail/BBH)    |
-
-# LLaMA-33B/65B
-
-This directory contains some of the scripts that were used to produce the results in the AscendSpeed. These scripts is to show the example how to run llama-65B in terminal.
-
-LLaMA model is from: [LLaMA: OPen and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971v1.pdf)
-
->Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
-
-## Training
-
-LLaMA's model performace is better than GPT3 with less parameters. The 33B/65B LLaMA model is comparable to Google's Chinchilla-70B and Palm-540B.
-
-Here's a hardware summary of training llama:
-
-| Hardware |      Value       |
-| :------: |:----------------:|
-|   NPU    | 32 x Ascend NPUs |
-
-
-### Datasets
-The model was trained using alpaca datasets.
-
-### Script
-1.Clone the repository to your local server
-```shell
-git clone https://gitee.com/ascend.AscendSpeed.git
-cd AscendSpeed
-mkdir logs
-mkdir ckpt
-```
-2.Install AscendSpeed requirement environment.
-```shell
-# python3.8
-conda create -n test python=3.8
-conda activate test
-
-# install torch and torch_npu
-# ARM
-wget https://download.pytorch.org/whl/torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
-
-# X86
-#pip install torch==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
-#pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
-
-# install megatron-core
-pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
-
-# install deepspeed and deepspeed_npu
-pip install deepspeed==0.9.2
-git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
-cd deepspeed_npu
-pip3 install -e ./
-cd ..
-
-# install other packages
-pip install -r requirements.txt
-```
-3.Download checkpoint
-
-llama-33B checkpoint
-```shell
-mkdir tokenizer
-cd ./tokenizer
-
-# make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/pinkmanlove/llama-33b-hf
-cd ..
-```
-
-llama-65B checkpoint
-```shell
-mkdir tokenizer
-cd ./tokenizer
-
-# make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/pinkmanlove/llama-65b-hf
-cd ..
-```
-4.In order to adapt to llama-33B/65B model, the following script is used to convert the model pre-training weights
-
-llama-33B
-```shell
-mkdir model_weights
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-      --input-model-dir ./tokenizer \
-      --output-model-dir ./model_weights \
-      --tensor-model-parallel-size 4 \
-      --pipeline-model-parallel-size 4 \
-      --merge-mlp \
-      --type 30B
-```
-
-llama-65B
-```shell
-mkdir model_weights
-
-SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
-python $SCRIPT_PATH \
-      --input-model-dir ./tokenizer \
-      --output-model-dir ./model_weights \
-      --tensor-model-parallel-size 8 \
-      --pipeline-model-parallel-size 4 \
-      --type 65B
-```
-
-5.Download dataset
-```shell
-# for llama, dowload alpaca dataset, like
-wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason
-
-# download tokenizer configs nad (selective) weights from
-# http://huggingface.co/pinkmanlove/llama-33b-hf
-# http://huggingface.co/pinkmanlove/llama-65b-hf
-# revise "LLaMATokenizer" as "LLaMTokenizer" in tokenizer_config.json
-mkdir dataset
-python tools/preprocess_data.py --input alpaca_data.json\
-                                --output-prefix dataset/alpaca\
-                                --tokenizer-type PretrainedFromHF\
-                                --tokenizer-name-or-path llama-33b-hf
-                               #--tokenizer-name-or-path llama-65b-hf
-                                --tokenizer-not-use-fast
-                                --handler-name GeneralInstructionHandler
-```
-
-6.Config llama-33B/65B pre-training script :
-AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
-AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
-
-```bash
-# modify the script according to your own conda and ascend-toolkit path
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HEEL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# modify script orign dataset path according to your own dataset path
-TOKENIZER_PATH=./dataset/llama_tokenizer # line 16
-DATA_PATH=./dataset/llama_text_document # line 17
-```
-
-7.Launch  pre-training script:
-
-Launch llama-33B pre-training script : AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
-```bash
-bash examples/llama/pretrain_llama_33B_ptd_32p.sh
-```
-
-Launch llama-65B pre-training script : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
-```bash
-bash examples/llama/pretrain_llama_65B_ptd_32p.sh
-```
-Config llama-33B/65B pre-training script for multinode (Launch llama-65B pre-training script on each machine):
-
-```shell
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=4
-NODE_RANK=0
-```
-The Training log will look like these:
-
-```Shell
- iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
-iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
-time (ms)
-```
-
-### Performance
-
-#### Machine performance
-
-The performance of the NPUs in **Ascend** and Reference:
-
-|  Device   |   Model   | throughput rate (tokens/s/p) |
-|:---------:|:---------:|:----------------------------:|
-| Reference | llama-33B |             776              |
-|   NPUs    | llama-33B |             621              |
-| Reference | llama-65B |             426              |
-|   NPUs    | llama-65B |             348              |
-
-
-#### Accuracy of the loss
-
-NPU vs GPU loss and relative error:
-
-LLaMa-33B
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
-
-![NPU-LOSS](../../sources/images/llama/llama33B_loss.png)
-
-The relative error between NPU and GPU Loss is less than 0.03 throughout, as expected.
-
-![NPU-Relative-Error](../../sources/images/llama/llama33B_relative_error.png)
-
-
-
-LLaMa-65B
-
-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
-
-![NPU-LOSS](../../sources/images/llama/loss_chart.png)
-
-The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.
-
-![NPU-Relative-Error](../../sources/images/llama/compare_chart.png)
-
-## Inference
-
-We support AscendSpeed Inference for text generation with LLaMA-33B and LLaMA-65B.
-Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
-
-Config LLaMA-33B inference script `examples/llama/generate_llama_33B_ptd.sh`.
-
-Config LLaMA-65B inference script `examples/llama/generate_llama_65B_tp8_pp1.sh`.
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT=<checkpoint-path>
-VOCAB_FILE=<vocabfile-path>
-```
-
-LLaMA-33B:
-```shell
-bash ./examples/llama/generate_llama_33B_ptd.sh
-```
-LLaMA-65B:
-```shell
-bash ./examples/llama/generate_llama_65B_tp8_pp1.sh
-```
-
-Some inference samples are as follows:
-
-LLaMA-33B:
-
-![llama-13B_generate.png](../../sources/images/llama/llama33B_generate.png)
-
-LLaMA-65B:
-
-![llama-65B_generate.png](../../sources/images/llama/llama65B_generate.png)
-
-
-## Evaluation with Numerous Benchmarks
-
-We use Boolq benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/boolq).
-
-Config LLaMA-33B evaluation script:
-
-```shell
-    CHECKPOINT=./llama-33b-tp4-pp2/
-    VOCAB_FILE=./llama-33b-hf/
-    # 配置任务和数据路径
-    DATA_PATH="./boolq/data/test/"
-    TASK="boolq"
-    # 配置生成参数
-    python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py   \
-         --task-data-path $DATA_PATH \
-         --task $TASK\
-         --seq-length 1024 \
-         --max-new-tokens 2 \
-         --max-position-embeddings 1024 \
-         --tensor-model-parallel-size 4  \
-         --pipeline-model-parallel-size 2  \
-         --num-layers 60 \
-         --hidden-size 6656  \
-         --ffn-hidden-size 17920 \
-         --load ${CHECKPOINT}  \
-         --num-attention-heads 52  \
-         --tokenizer-type PretrainedFromHF  \
-         --tokenizer-name-or-path ${VOCAB_FILE} \
-         --tokenizer-not-use-fast \
-         --fp16  \
-         --micro-batch-size 1  \
-         --position-embedding-type rope \
-         --normalization RMSNorm \
-         --mlp-layer-fusion \
-         --seed 42
-```
-
-```shell
-# start evaluation
-# evaluate llama-65B
-bash tasks/evaluation/evaluate_llama_65B_tp8_pp1.sh
-```
-
-The evaluation performance of LLaMA-7B/13B in **Ascend NPU**:
-
-| Task                                           | Model     | NPU  | Benchmark |
-|------------------------------------------------|-----------|------|-----------|
-| [Boolq](https://huggingface.co/datasets/boolq) | LLaMA-33B | 83.2 | [83.1](https://paperswithcode.com/sota/question-answering-on-boolq) |
-| [Boolq](https://huggingface.co/datasets/boolq) | LLaMA-65B | 85.7 | [86.6](https://paperswithcode.com/sota/question-answering-on-boolq) |
-
-## Citation
-
-You may also consider original work in your reference:
-
-```shell
-@article{Touvron2023llama,
-  title={LLaMA: OPen and Efficient Foundation Language Models},
-  author={Hugo Touvron*, Thibaut Lavril*, Gautier Izacard*, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal,
-  Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave*, Guillaume Lample*},
-  journal={arXiv preprint arXiv:2302.13971},
-  year={2023}}
-```
--- a/examples/llama/generate_llama_13B_tp1_pp8.sh
+++ b/examples/llama/generate_llama_13B_tp1_pp8.sh
@ -1,38 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 40  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama/generate_llama_33B_ptd.sh
+++ b/examples/llama/generate_llama_33B_ptd.sh
@ -1,39 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 4  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 60  \
-       --hidden-size 6656  \
-       --ffn-hidden-size 17920 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 52  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --mlp-layer-fusion \
--- a/examples/llama/generate_llama_65B_tp8_pp1.sh
+++ b/examples/llama/generate_llama_65B_tp8_pp1.sh
@ -1,38 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 80  \
-       --hidden-size 8192  \
-       --ffn-hidden-size 22016 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 64  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama/generate_llama_7B_deepspeed.sh
+++ b/examples/llama/generate_llama_7B_deepspeed.sh
@ -1,56 +0,0 @@
-#!/bin/bash
-
-export TOKENIZERS_PARALLELISM=false
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 1000,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 8
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size ${MICRO_BATCH_SIZE} \
-       --seq-length 2048 \
-       --max-new-tokens 128 \
-       --seed 42 \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --no-pipeline-parallel \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama/generate_llama_7B_deepspeed_pipeline.sh
+++ b/examples/llama/generate_llama_7B_deepspeed_pipeline.sh
@ -1,56 +0,0 @@
-#!/bin/bash
-
-export TOKENIZERS_PARALLELISM=false
-
-NNODES=1
-NPUS_PER_NODE=8
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-ZERO_STAGE=0
-MICRO_BATCH_SIZE=1
-config_json="./ds_config.json"
-
-cat <<EOT > $config_json
-{
-  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": $ZERO_STAGE
-  },
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0,
-    "loss_scale_window": 500,
-    "hysteresis": 2,
-    "min_loss_scale": 1,
-    "initial_scale_power": 12
-  },
-  "steps_per_print": 2000,
-  "wall_clock_breakdown": false
-}
-EOT
-
-deepspeed --num_nodes $NNODES --num_gpus $NPUS_PER_NODE \
-       ./tasks/inference/inference_llama_pipeline.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --num-layers 30  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --deepspeed \
-       --deepspeed_config ${config_json} \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama/generate_llama_7B_lora_tp1_pp1.sh
+++ b/examples/llama/generate_llama_7B_lora_tp1_pp1.sh
@ -1,44 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=1
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-LORA_CHECKPOINT="your lora checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --load "${CHECKPOINT}"  \
-       --lora-load "${LORA_CHECKPOINT}" \
-       --num-attention-heads 32  \
-       --seq-length 1024 \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama/generate_llama_7B_tp2_pp2.sh
+++ b/examples/llama/generate_llama_7B_tp2_pp2.sh
@ -1,39 +0,0 @@
-#!/bin/bash
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=4
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT="your megatron checkpoint path"
-VOCAB_FILE="your vocab path"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 2  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 11008 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 32  \
-       --max-position-embeddings 2048 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 1024 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-
--- a/examples/llama/pretrain_llama_13B_ptd_16p.sh
+++ b/examples/llama/pretrain_llama_13B_ptd_16p.sh
@ -1,65 +0,0 @@
-# This is an example: train llama using PTD.
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=16
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-      pretrain_llama.py \
-      --DDP-impl local \
-      --tensor-model-parallel-size 1 \
-      --pipeline-model-parallel-size 2 \
-      --num-layers 40 \
-      --hidden-size 5120 \
-      --position-embedding-type rope \
-      --normalization RMSNorm \
-      --ffn-hidden-size 13824 \
-      --num-attention-heads 40 \
-      --micro-batch-size 1 \
-      --global-batch-size 512 \
-      --seq-length 2048 \
-      --max-position-embeddings 2048 \
-      --train-iters 5000 \
-      --lr-decay-iters 5000 \
-      --load $LOAD_CHECKPOINT_PATH \
-      --data-path $DATA_PATH \
-      --tokenizer-name-or-path $TOKENIZER_PATH \
-      --tokenizer-not-use-fast \
-      --attention-dropout 0.0 \
-      --hidden-dropout 0.0 \
-      --data-impl mmap \
-      --split 949,50,1 \
-      --distributed-backend nccl \
-      --lr 1.0e-6 \
-      --lr-decay-style cosine \
-      --min-lr 1.0e-7 \
-      --weight-decay 1e-2 \
-      --clip-grad 1.0 \
-      --lr-warmup-fraction .01 \
-      --log-interval 1 \
-      --save-interval 10000 \
-      --eval-interval 1000 \
-      --eval-iters 10 \
-      --initial-loss-scale 4096.0 \
-      --checkpoint-activations \
-      --use-fused-rotary-pos-emb \
-      --use-flash-attn \
-      --use-distributed-optimizer \
-      --use-fused-rmsnorm \
-      --fp16 | tee logs/train_llama_13B.log
--- a/examples/llama/pretrain_llama_13B_ptd_8p.sh
+++ b/examples/llama/pretrain_llama_13B_ptd_8p.sh
@ -1,60 +0,0 @@
-# This is an example: train llama using PTD.
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DATA_PATH=./dataset/llama_text_document
-CHECKPOINT=./model_weights/llama-13b
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --is-instruction-dataset \
-       --DDP-impl local \
-       --tensor-model-parallel-size 1 \
-       --pipeline-model-parallel-size 8 \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40 \
-       --micro-batch-size 1 \
-       --global-batch-size 128 \
-       --seq-length 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --max-position-embeddings 2048 \
-       --train-iters 5000 \
-       --load $CHECKPOINT \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path ./dataset/llama/ \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 1.0e-6 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-7 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --initial-loss-scale 4096.0 \
-       --checkpoint-activations \
-       --recompute-method custom \
-       --recomputation-layer-num 3 2 1 0 0 0 0 0 \
-       --use-fused-rotary-pos-emb \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       --fp16 | tee logs/train_13B.log
--- a/examples/llama/pretrain_llama_33B_ptd_32p.sh
+++ b/examples/llama/pretrain_llama_33B_ptd_32p.sh
@ -1,68 +0,0 @@
-# This is an example: train llama using PTD.
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=4
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-TOKENIZER_PATH=./dataset/llama_tokenizer
-DATA_PATH=./dataset/llama_text_document
-CHECKPOINT_PATH=./ckpt
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 4 \
-       --pipeline-model-parallel-size 4 \
-       --num-layers 60 \
-       --hidden-size 6656 \
-       --ffn-hidden-size 17920 \
-       --num-attention-heads 52 \
-       --micro-batch-size 4 \
-       --global-batch-size 512 \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --train-iters 50000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_PATH \
-       --load $CHECKPOINT_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --initial-loss-scale 524288.0 \
-       --sequence-parallel \
-       --mlp-layer-fusion \
-       --use-distributed-optimizer \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --use-fused-rmsnorm \
-       --use-flash-attn \
-       --release-fp32-grad \
-       --checkpoint-activations \
-       --recompute-method custom \
-       --recomputation-layer-num 3 0 0 0 \
-       --fp16 | tee logs/train.log
--- a/examples/llama/pretrain_llama_65B_ptd_32p.sh
+++ b/examples/llama/pretrain_llama_65B_ptd_32p.sh
@ -1,65 +0,0 @@
-# This is an example: train llama using PTD.
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export MULTI_STREAM_MEMORY_REUSE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=4
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-TOKENIZER_PATH=./dataset/llama_tokenizer
-DATA_PATH=./dataset/llama_text_document
-CHECKPOINT_PATH=./ckpt
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 4 \
-       --num-layers 80 \
-       --hidden-size 8192 \
-       --ffn-hidden-size 22016 \
-       --num-attention-heads 64 \
-       --micro-batch-size 2 \
-       --global-batch-size 128 \
-       --seq-length 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --max-position-embeddings 2048 \
-       --train-iters 50000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_PATH \
-       --load $CHECKPOINT_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --auto-recompute-device-size -1 \
-       --initial-loss-scale 524288.0 \
-       --sequence-parallel \
-       --swiglu \
-       --no-add-gate \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       --bf16 | tee logs/train.log
--- a/examples/llama/pretrain_llama_7B_zero_8p.sh
+++ b/examples/llama/pretrain_llama_7B_zero_8p.sh
@ -1,101 +0,0 @@
-# This is an example: train llama using TD,
-# the number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export INF_NAN_MODE_ENABLE=1
-source /path/to/cann/ascend-toolkit/set_env.sh
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-DATA=./dataset/llama_text_document
-CHECKPOINT=./model_weights/llama-7b
-
-DS_CONFIG=deepspeed_config_7B.json
-ZERO_STAGE=2
-GLOBAL_BATCH=64
-MICRO_BATCH=2
-
-cat <<EOT > $DS_CONFIG
-{
-    "fp16": {
-        "enabled": true,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "optimizer": {
-        "type": "Adam"
-    },
-
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "allgather_partitions": true,
-        "allgather_bucket_size": 5e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 5e8,
-        "contiguous_gradients": true
-    },
-
-    "gradient_accumulation_steps": 4,
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true
-}
-EOT
-
-ds_args=""
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}"
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
-
-deepspeed pretrain_llama.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 1 \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 32 \
-       --hidden-size 4096 \
-       --ffn-hidden-size 11008 \
-       --num-attention-heads 32 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 2048 \
-       --max-position-embeddings 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --train-iters 5000 \
-       --load $CHECKPOINT \
-       --data-path $DATA \
-       --split 900,50,50 \
-       --tokenizer-name-or-path ./dataset/llama/ \
-       --tokenizer-not-use-fast \
-       --init-method-std 0.01 \
-       --distributed-backend nccl \
-       --lr 1.0e-6 \
-       --min-lr 1.0e-6 \
-       --lr-decay-style cosine \
-       --lr-warmup-fraction .01 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.95 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-bias-gelu-fusion \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       $ds_args \
-       --fp16 | tee logs/train_7B.log
--- a/examples/llama/tune_llama_deepspeed_13B.sh
+++ b/examples/llama/tune_llama_deepspeed_13B.sh
@ -1,120 +0,0 @@
-# This is an example: train llama using TD,
-
-# the number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/home/anaconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-TP=1
-PP=1
-
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-
-DS_CONFIG=deepspeed_config_13B.json
-ZERO_STAGE=2
-
-MICRO_BATCH=4
-GRADIENT_ACCUMULATION_STEP=4
-GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
-EPOCH=2
-TRAIN_ITERS=$((52000 / $GLOBAL_BATCH * $EPOCH))
-echo $TRAIN_ITERS
-SAVE_INTERVAL=$(($TRAIN_ITERS / 4))
-echo $SAVE_INTERVAL
-
-export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
-
-cat <<EOT > $DS_CONFIG
-{
-    "fp16": {
-        "enabled": true,
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 8,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-
-    "optimizer": {
-        "type": "Adam"
-    },
-
-    "zero_optimization": {
-        "stage": $ZERO_STAGE,
-        "allgather_partitions": true,
-        "allgather_bucket_size": 1e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 1e8,
-        "contiguous_gradients": true
-    },
-
-    "gradient_accumulation_steps": ${GRADIENT_ACCUMULATION_STEP},
-    "train_batch_size": $GLOBAL_BATCH,
-    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
-    "zero_allow_untested_optimizer": true
-}
-EOT
-
-ds_args=""
-ds_args=" --deepspeed ${ds_args}"
-ds_args=" --no-pipeline-parallel ${ds_args}"
-ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
-ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
-ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
-
-deepspeed pretrain_llama.py \
-       --DDP-impl local \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40 \
-       --micro-batch-size $MICRO_BATCH \
-       --global-batch-size $GLOBAL_BATCH \
-       --seq-length 256 \
-       --max-position-embeddings 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --train-iters ${TRAIN_ITERS} \
-       --lr-decay-iters ${TRAIN_ITERS} \
-       --save $SAVE_CHECKPOINT_PATH \
-       --load $LOAD_CHECKPOINT_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 2e-5 \
-       --lr-decay-style cosine \
-       --min-lr 0 \
-       --weight-decay 0. \
-       --clip-grad 1.0 \
-       --lr-warmup-iters 200 \
-       --checkpoint-activations \
-       --log-interval 1 \
-       --save-interval ${SAVE_INTERVAL} \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --use-cpu-initialization \
-       --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --is-instruction-dataset \
-       $ds_args \
-       --fp16 | tee logs/train_13B_deepspeed.log
--- a/examples/llama/tune_llama_ptd_13B.sh
+++ b/examples/llama/tune_llama_ptd_13B.sh
@ -1,83 +0,0 @@
-# This is an example: train llama using PTD.
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6002
-NNODES=1
-NODE_RANK=0    #1
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-MICRO_BATCH=4
-GRADIENT_ACCUMULATION_STEP=4
-GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
-EPOCH=5
-TRAIN_ITERS=$((52000 / $GLOBAL_BATCH * $EPOCH))
-echo $TRAIN_ITERS
-SAVE_INTERVAL=$(($TRAIN_ITERS / 4))
-echo $SAVE_INTERVAL
-
-TP=4
-PP=2
-
-DATA_PATH=<data-path>
-LOAD_CHECKPOINT_PATH=<origin-ckpt-path>
-SAVE_CHECKPOINT_PATH=<ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch ${DISTRIBUTED_ARGS} \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40 \
-       --micro-batch-size ${MICRO_BATCH} \
-       --global-batch-size ${GLOBAL_BATCH} \
-       --seq-length 256 \
-       --max-position-embeddings 2048 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --train-iters ${TRAIN_ITERS} \
-       --save ${SAVE_CHECKPOINT_PATH} \
-       --load ${LOAD_CHECKPOINT_PATH} \
-       --checkpoint-activations \
-       --data-path ${DATA_PATH} \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 2e-5 \
-       --lr-decay-style cosine \
-       --lr-decay-iters ${TRAIN_ITERS} \
-       --min-lr 0 \
-       --weight-decay 0. \
-       --clip-grad 1.0 \
-       --lr-warmup-iters 200 \
-       --log-interval 1 \
-       --save-interval ${SAVE_INTERVAL} \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --initial-loss-scale 4096.0 \
-       --seed 1234 \
-       --zero-stage 2 \
-       --is-instruction-dataset \
-       --lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --fp16 | tee logs/train_13B_megatron.log
-
--- a/examples/llama2/generate_llama2_13B_tp8_pp1.sh
+++ b/examples/llama2/generate_llama2_13B_tp8_pp1.sh
@ -1,42 +0,0 @@
-#!/bin/bash
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
-                  --nnodes $NNODES \
-                  --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR \
-                  --master_port $MASTER_PORT"
-
-CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1/
-VOCAB_FILE=./model/LLAMA-2-13B-hf
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --no-contiguous-buffers-in-local-ddp \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13824 \
-       --load "${CHECKPOINT}"  \
-       --num-attention-heads 40  \
-       --max-position-embeddings 4096 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "$VOCAB_FILE" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 4096 \
-       --max-new-tokens 256 \
-       --seed 42 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
--- a/examples/llama2/generate_llama2_34B_ptd.sh
+++ b/examples/llama2/generate_llama2_34B_ptd.sh
@ -1,47 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# modify config according to your own actual situation
-CHECKPOINT="your model path"
-TOKENIZER_PATH=./llama2-70b-hf/
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 48  \
-       --hidden-size 8192  \
-       --ffn-hidden-size 22016 \
-       --mlp-layer-fusion \
-       --load ${CHECKPOINT}  \
-       --num-attention-heads 64 \
-       --position-embedding-type rope \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --max-position-embeddings 4096 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --pad-vocab-size-to 32000 \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 4096 \
-       --max-new-tokens 256 \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       --seed 42 \
-       --normalization RMSNorm \
--- a/examples/llama2/generate_llama2_70B_ptd.sh
+++ b/examples/llama2/generate_llama2_70B_ptd.sh
@ -1,47 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# modify config according to your own actual situation
-CHECKPOINT="your model path"
-TOKENIZER_PATH=./llama2-70b-hf/
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 80  \
-       --hidden-size 8192  \
-       --ffn-hidden-size 28672 \
-       --mlp-layer-fusion \
-       --load ${CHECKPOINT}  \
-       --num-attention-heads 64 \
-       --position-embedding-type rope \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --max-position-embeddings 4096 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --pad-vocab-size-to 32000 \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --micro-batch-size 1 \
-       --seq-length 4096 \
-       --max-new-tokens 256 \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       --seed 42 \
-       --normalization RMSNorm \
--- a/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+++ b/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
@ -1,66 +0,0 @@
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export INF_NAN_MODE_ENABLE=0
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DATA_PATH=./dataset/llama_text_document
-LOAD_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1
-SAVE_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
-TOKENIZER_PATH=./model/LLAMA-2-13B-hf
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 1 \
-       --sequence-parallel \
-       --num-layers 40 \
-       --hidden-size 5120 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --ffn-hidden-size 13824 \
-       --num-attention-heads 40 \
-       --micro-batch-size 2 \
-       --global-batch-size 256 \
-       --seq-length 4096 \
-       --max-position-embeddings 4096 \
-       --train-iters 5000 \
-       --lr-warmup-iters 1000 \
-       --save $SAVE_CHECKPOINT \
-       --load $LOAD_CHECKPOINT \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 1e-6 \
-       --lr-decay-style cosine \
-       --min-lr 1e-8 \
-       --weight-decay 1e-1 \
-       --clip-grad 1.0 \
-       --initial-loss-scale 4096.0 \
-       --seed 1234 \
-       --adam-beta1 0.9 \
-       --adam-beta2 0.95 \
-       --adam-eps 1.0e-5 \
-       --log-interval 1 \
-       --save-interval 1000 \
-       --eval-interval 1000 \
-       --eval-iters 100 \
-       --use-fused-rotary-pos-emb \
-       --use-flash-attn \
-       --use-fused-rmsnorm \
-       --bf16 | tee logs/train_13B_llama2_npu.log
--- a/examples/llama2/pretrain_llama2_34B_ptd.sh
+++ b/examples/llama2/pretrain_llama2_34B_ptd.sh
@ -1,68 +0,0 @@
-# This is an example: training llama using PTD/
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-export COMBINED_ENABLE=1
-export MULTI_STREAM_MEMORY_REUSE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=2
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DATA_PATH=./dataset/llama_text_document
-TOKENIZER_PATH=./tokenizer/
-CHECKPOINT_LOAD_PATH=./load_ckpt
-CHECKPOINT_SAVE_PATH=./save_ckpt
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-nohup python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --sequence-parallel \
-       --use-flash-attn \
-       --mlp-layer-fusion \
-       --use-fused-rmsnorm \
-       --release-fp32-grad \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 2 \
-       --num-layers 48 \
-       --hidden-size 8192 \
-       --ffn-hidden-size 22016 \
-       --num-attention-heads 64 \
-       --normalization RMSNorm \
-       --position-embedding-type rope \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --micro-batch-size 2 \
-       --global-batch-size 512 \
-       --seq-length 4096 \
-       --max-position-embeddings 4096 \
-       --train-iters 1000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_SAVE_PATH \
-       --load $CHECKPOINT_LOAD_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path $TOKENIZER_PATH \
-       --tokenizer-not-use-fast \
-       --pad-vocab-size-to 32000 \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 10000 \
-       --eval-iters 10 \
-       --bf16 | tee logs/train.log &
--- a/examples/llama2/pretrain_llama2_70B_ptd.sh
+++ b/examples/llama2/pretrain_llama2_70B_ptd.sh
@ -1,67 +0,0 @@
-# This is an example: training llama using PTD/
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-export COMBINED_ENABLE=1
-export MULTI_STREAM_MEMORY_REUSE=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=8
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DATA_PATH=./dataset/llama_text_document
-CHECKPOINT_LOAD_PATH=./load_ckpt
-CHECKPOINT_SAVE_PATH=./save_ckpt
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# Main script
-python -m torch.distributed.launch $DISTRIBUTED_ARGS \
-       pretrain_llama.py \
-       --DDP-impl local \
-       --use-flash-attn \
-       --sequence-parallel \
-       --mlp-layer-fusion \
-       --use-fused-rmsnorm \
-       --release-fp32-grad \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 8 \
-       --num-layers 80 \
-       --hidden-size 8192 \
-       --ffn-hidden-size 28672 \
-       --num-attention-heads 64 \
-       --position-embedding-type rope \
-       --normalization RMSNorm \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --micro-batch-size 2 \
-       --global-batch-size 1024 \
-       --seq-length 4096 \
-       --max-position-embeddings 4096 \
-       --train-iters 5000 \
-       --lr-decay-iters 320000 \
-       --save $CHECKPOINT_SAVE_PATH \
-       --load $CHECKPOINT_LOAD_PATH \
-       --data-path $DATA_PATH \
-       --tokenizer-name-or-path ./dataset/llama/ \
-       --tokenizer-not-use-fast \
-       --pad-vocab-size-to 32000 \
-       --data-impl mmap \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --min-lr 1.0e-5 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --lr-warmup-fraction .01 \
-       --log-interval 1 \
-       --save-interval 10000 \
-       --eval-interval 1000 \
-       --eval-iters 10 \
-       --bf16 | tee logs/train.log
--- a/examples/llama2/tune_llama2_7b_ptd.sh
+++ b/examples/llama2/tune_llama2_7b_ptd.sh
@ -2,8 +2,8 @@

 export CUDA_DEVICE_MAX_CONNECTIONS=1

-source "/home/usr/cann/envs/ascend-toolkit/set_env.sh"
-
+source "./envs/ascend-toolkit/set_env.sh"
+#export ASCEND_RT_VISIBLE_DEVICES=2
 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
 MASTER_PORT=6000
@ -12,9 +12,9 @@ NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

 CKPT_SAVE_DIR="./models/lora_weight_save"
-DATA_PATH="./dataset_llama2/alpaca_text_document"
+DATA_PATH="./dataset_llama2/test/alpaca_text_document"
 TOKENIZER_MODEL="./llama-2-7b-hf/tokenizer.model"
-CKPT_LOAD_DIR="./llama-2-7b-hf/megatron"
+CKPT_LOAD_DIR="./llama-2-7b-hf/test"
 LORA_CHECKPOINT="./models/lora_weight"

 TP=8
--- a/modellink/init.py
+++ b/modellink/init.py
@ -22,19 +22,6 @@ try:
 except Exception as e:
    logging.warning("Warning: You did not install torch_npu")

-from .global_vars import get_args
-from .global_vars import get_current_global_batch_size
-from .global_vars import get_num_microbatches
-from .global_vars import update_num_microbatches
-from .global_vars import get_tokenizer
-from .global_vars import get_tensorboard_writer
-from .global_vars import get_adlr_autoresume
-from .global_vars import get_timers
-from .global_vars import get_retro_args
-from .utils import print_rank_0
-from .utils import print_rank_last
-from .utils import is_last_rank
-from .utils import is_rank_0
 from .tokenizer import apply_tokenizer_patch
 from .adaptor_arguments import apply_arguments_patch
 from .adaptor_model import apply_model_patch
--- a/modellink/amp_C.py
+++ b/modellink/amp_C.py
--- a/modellink/arguments.py
+++ b/modellink/arguments.py
--- a/modellink/checkpointing.py
+++ b/modellink/checkpointing.py
@ -1,695 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Input/output checkpointing."""
-
-import os
-import random
-import sys
-import enum
-import traceback
-import numpy as np
-from deepspeed import PipelineEngine
-from deepspeed.accelerator import get_accelerator
-import torch
-from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
-
-from modellink.enums import PositionEmbeddingType
-from modellink.utils import WRITE_FILE_DEFAULT_FLAGS, WRITE_FILE_DEFAULT_MODES
-from modellink.utils import cpu_tensor_reduce_context
-from modellink.utils import convert_args_to_strs
-
-from modellink import (get_args,
-                       is_rank_0,
-                       print_rank_0,
-                       update_num_microbatches,
-                       utils)
-from modellink.core import parallel_state, tensor_parallel
-from modellink.model import DistributedDataParallel as LocalDDP, Float16Module
-from modellink.model.lora_utils import is_enable_lora, get_lora_state_dict, lora_custom_load_fn_for_deepspeed, \
-    get_lora_model_classes, get_lora_state_dict_with_deepspeed, update_model_state_dict_with_megatron, \
-    get_lora_load_fn_with_deepspeed, handle_lora_modules_to_save_key_with_megatron
-from modellink.error_utils import check_equal, ensure_valid
-
-_CHECKPOINT_VERSION = None
-
-
-def set_checkpoint_version(value):
-    global _CHECKPOINT_VERSION
-    if _CHECKPOINT_VERSION is not None:
-        error_info = "checkpoint versions do not match"
-        check_equal(_CHECKPOINT_VERSION, value, error_info)
-    _CHECKPOINT_VERSION = value
-
-
-def get_checkpoint_version():
-    global _CHECKPOINT_VERSION
-    return _CHECKPOINT_VERSION
-
-
-def check_checkpoint_args(checkpoint_args):
-    """
-    Ensure fixed arguments for a model are the same for the input
-    arguments and the one retrieved from checkpoint.
-    """
-    args = get_args()
-
-    def _compare(arg_name, old_arg_name=None):
-        if old_arg_name is not None:
-            checkpoint_value = getattr(checkpoint_args, old_arg_name)
-        else:
-            checkpoint_value = getattr(checkpoint_args, arg_name)
-        args_value = getattr(args, arg_name)
-        if isinstance(args_value, enum.Enum):
-            args_value = args_value.name
-        if isinstance(checkpoint_value, enum.Enum):
-            checkpoint_value = checkpoint_value.name
-        error_info = '{} value from checkpoint ({}) is not equal to the ' \
-                     'input argument value ({}).'.format(
-            arg_name, checkpoint_value, args_value)
-        check_equal(checkpoint_value, args_value, error_info)
-
-    if not args.mos and not args.kd:
-        _compare('num_layers')
-    _compare('hidden_size')
-    _compare('num_attention_heads')
-    _compare('position_embedding_type')
-    # with alibi we can change `max_position_embeddings`
-    if args.position_embedding_type != PositionEmbeddingType.alibi:
-        _compare('max_position_embeddings')
-
-    if args.vocab_file:
-        _compare('make_vocab_size_divisible_by')
-        _compare('padded_vocab_size')
-        _compare('tokenizer_type')
-    if get_checkpoint_version() < 3.0:
-        _compare('tensor_model_parallel_size',
-                 old_arg_name='model_parallel_size')
-    if get_checkpoint_version() >= 3.0:
-        _compare('tensor_model_parallel_size')
-        _compare('pipeline_model_parallel_size')
-
-
-def ensure_directory_exists(filename):
-    """Build filename's path if it does not already exists."""
-    dirname = os.path.dirname(filename)
-    if not os.path.exists(dirname):
-        os.makedirs(dirname)
-
-
-def get_checkpoint_name(checkpoints_path, iteration,
-                        release=False, model_name='model_optim_rng.pt'):
-    """A unified checkpoint name."""
-    if release:
-        directory = 'release'
-    else:
-        directory = 'iter_{:07d}'.format(iteration)
-    # Use both the tensor and pipeline MP rank.
-    if parallel_state.get_pipeline_model_parallel_world_size() == 1:
-        return os.path.join(checkpoints_path, directory,
-                            'mp_rank_{:02d}'.format(
-                                parallel_state.get_tensor_model_parallel_rank()),
-                            model_name)
-    return os.path.join(checkpoints_path, directory,
-                        'mp_rank_{:02d}_{:03d}'.format(
-                            parallel_state.get_tensor_model_parallel_rank(),
-                            parallel_state.get_pipeline_model_parallel_rank()),
-                        model_name)
-
-
-def get_checkpoint_tracker_filename(checkpoints_path):
-    """
-    Tracker file rescords the latest chckpoint during
-    training to restart from.
-    """
-    return os.path.join(checkpoints_path, 'latest_checkpointed_iteration.txt')
-
-
-def get_distributed_optimizer_checkpoint_name(model_checkpoint_name):
-    return os.path.join(os.path.dirname(model_checkpoint_name),
-                        "distrib_optim.pt")
-
-
-def save_checkpoint(iteration, model, optimizer, lr_scheduler):
-    """Save a model checkpoint."""
-    args = get_args()
-
-    # Only rank zero of the data parallel writes to the disk.
-    if not args.deepspeed:
-        unwrap_model_classes = (torchDDP, LocalDDP, Float16Module)
-        if is_enable_lora():
-            unwrap_model_classes += get_lora_model_classes()
-        model = utils.unwrap_model(model, unwrap_model_classes)
-
-    print_rank_0('saving checkpoint at iteration {:7d} to {}'.format(
-        iteration, args.save))
-
-    checkpoint_name = get_checkpoint_name(args.save, iteration)
-
-    # Save distributed optimizer's custom parameter state.
-    if args.use_distributed_optimizer:
-        optim_checkpoint_name = \
-            get_distributed_optimizer_checkpoint_name(checkpoint_name)
-        if parallel_state.get_data_parallel_rank() == 0:
-            ensure_directory_exists(optim_checkpoint_name)
-        optimizer.save_parameter_state(optim_checkpoint_name)
-
-    if not torch.distributed.is_initialized() or parallel_state.get_data_parallel_rank() == 0 \
-            or args.deepspeed:
-
-        # Arguments, iteration, and model.
-        state_dict = {}
-        state_dict['args'] = convert_args_to_strs(args)
-        state_dict['checkpoint_version'] = 3.0
-        state_dict['iteration'] = iteration
-        state_dict['tokens'] = args.consumed_train_tokens
-
-        # DeepSpeed saves the model/optimizer/scheduler
-        if not args.deepspeed:
-            get_model_state_dict(model, state_dict)
-
-            # Optimizer stuff.
-            if not args.no_save_optim:
-                if optimizer is not None:
-                    state_dict['optimizer'] = optimizer.state_dict()
-                if lr_scheduler is not None:
-                    state_dict['lr_scheduler'] = lr_scheduler.state_dict()
-
-        # RNG states.
-        if not args.no_save_rng:
-            state_dict['random_rng_state'] = random.getstate()
-            state_dict['np_rng_state'] = np.random.get_state()
-            state_dict['torch_rng_state'] = torch.get_rng_state()
-            state_dict['cuda_rng_state'] = get_accelerator().get_rng_state()
-            state_dict['rng_tracker_states'] \
-                = tensor_parallel.get_cuda_rng_tracker().get_states()
-
-        # Save.
-        if not args.deepspeed:
-            ensure_directory_exists(checkpoint_name)
-            with cpu_tensor_reduce_context(args.save_to_cpu):
-                torch.save(state_dict, checkpoint_name)
-
-    if args.deepspeed:
-        original_state_dict = None
-        # modellink model uses state_dict_for_save_checkpointing instead of the standard state_dict
-        # state_dict is used by deepspeed for module saving so it needs to point to the right function
-        if args.no_pipeline_parallel:
-            original_state_dict = model[0].module.state_dict
-
-            def state_dict_for_save_checkpoint_deepspeed(destination=None, prefix='', keep_vars=False):
-                return model[0].module.state_dict_for_save_checkpoint(prefix=prefix, keep_vars=keep_vars)
-
-            model[0].module.state_dict = state_dict_for_save_checkpoint_deepspeed
-        if is_enable_lora():
-            if original_state_dict is None:
-                original_state_dict = model[0].module.state_dict
-            model[0].module.state_dict = get_lora_state_dict_with_deepspeed(model=model[0])
-
-        # Saving is a collective communication
-        checkpoint_name = get_checkpoint_name(args.save, iteration)
-
-        # Trim off the filename and mp_rank_* directory.
-        for _ in range(3):
-            checkpoint_name = os.path.dirname(checkpoint_name)
-        with cpu_tensor_reduce_context(args.save_to_cpu):
-            model[0].save_checkpoint(checkpoint_name, client_state=state_dict)
-
-        if original_state_dict is not None:
-            model[0].module.state_dict = original_state_dict
-
-    save_checkpoint_post_process(iteration)
-
-
-def get_model_state_dict(model, state_dict):
-    if len(model) == 1:
-        state_dict['model'] = model[0].state_dict_for_save_checkpoint()
-        if is_enable_lora():
-            state_dict['model'] = get_lora_state_dict(state_dict['model'])
-    else:
-        for i in range(len(model)):
-            parallel_state.set_virtual_pipeline_model_parallel_rank(i)
-            state_dict['model%d' % i] = model[i].state_dict_for_save_checkpoint()
-            if is_enable_lora():
-                state_dict['model%d' % i] = get_lora_state_dict(state_dict['model%d' % i])
-
-
-def read_metadata(tracker_filename):
-    # Read the tracker file and either set the iteration or
-    # mark it as a release checkpoint.
-    iteration = 0
-    release = False
-    with open(tracker_filename, 'r') as f:
-        metastring = f.read().strip()
-        try:
-            iteration = int(metastring)
-        except ValueError:
-            release = metastring == 'release'
-            if not release:
-                print_rank_0('ERROR: Invalid metadata file {}. Exiting'.format(
-                    tracker_filename))
-                sys.exit()
-    ensure_valid(iteration > 0 or release, 'error parsing metadata file {}'.format(
-        tracker_filename))
-
-    # Get the max iteration retrieved across the ranks.
-    if torch.distributed.is_initialized():
-        iters_cuda = torch.cuda.LongTensor([iteration])
-        torch.distributed.all_reduce(iters_cuda, op=torch.distributed.ReduceOp.MAX)
-        max_iter = iters_cuda[0].item()
-
-        # We should now have all the same iteration.
-        # If not, print a warning and chose the maximum
-        # iteration across all ranks.
-        if iteration != max_iter:
-            rank = torch.distributed.get_rank()
-            print('WARNING: on rank {} found iteration {} in the '
-                  'metadata while max iteration across the ranks '
-                  'is {}, replacing it with max iteration.'.format(
-                      rank, iteration, max_iter), flush=True)
-    else:
-        # When loading a checkpoint outside of training (for example,
-        # when editing it), we might not have torch distributed
-        # initialized, in this case, just assume we have the latest
-        max_iter = iteration
-    return max_iter, release
-
-
-def save_checkpoint_post_process(iteration):
-    args = get_args()
-
-    # Wait so everyone is done (necessary)
-    if torch.distributed.is_initialized():
-        torch.distributed.barrier()
-
-    print_rank_0('  successfully saved checkpoint at iteration {:7d} to {}'.format(
-        iteration, args.save))
-
-    # And update the latest iteration
-    if is_rank_0():
-        tracker_filename = get_checkpoint_tracker_filename(args.save)
-        with os.fdopen(os.open(tracker_filename, WRITE_FILE_DEFAULT_FLAGS, WRITE_FILE_DEFAULT_MODES), 'w') as f:
-            f.write(str(iteration))
-
-    # Wait so everyone is done (not necessary)
-    if torch.distributed.is_initialized():
-        torch.distributed.barrier()
-
-
-def _transpose_first_dim(t, num_splits, num_splits_first, model):
-    input_shape = t.size()
-    # We use a self_attention module but the values extracted aren't
-    # specific to self attention so should work for cross attention as well
-    while hasattr(model, 'module'):
-        model = model.module
-    # attention_module = model.language_model.encoder.layers[0].self_attention
-    attention_module = model.language_model.encoder.layers[0].attention
-    hidden_size_per_attention_head = attention_module.hidden_size_per_attention_head
-    num_attention_heads_per_partition = attention_module.num_attention_heads_per_partition
-    if num_splits_first:
-        """[num_splits * np * hn, h]
-        -->(view) [num_splits, np, hn, h]
-        -->(tranpose) [np, num_splits, hn, h]
-        -->(view) [np * num_splits * hn, h] """
-
-        intermediate_shape = \
-            (num_splits, num_attention_heads_per_partition,
-             hidden_size_per_attention_head) + input_shape[1:]
-
-        t = t.view(*intermediate_shape)
-        t = t.transpose(0, 1).contiguous()
-    else:
-        """[np * hn * num_splits, h]
-        -->(view) [np, hn, num_splits, h]
-        -->(tranpose) [np, num_splits, hn, h]
-        -->(view) [np * num_splits * hn, h] """
-
-        intermediate_shape = \
-            (num_attention_heads_per_partition,
-             hidden_size_per_attention_head, num_splits) + \
-            input_shape[1:]
-
-        t = t.view(*intermediate_shape)
-        t = t.transpose(1, 2).contiguous()
-    t = t.view(*input_shape)
-
-    return t
-
-
-def fix_query_key_value_ordering(model, checkpoint_version):
-    """Fix up query/key/value matrix ordering if checkpoint
-    version is smaller than 2.0
-    """
-    if checkpoint_version < 2.0:
-        if isinstance(model, list):
-            check_equal(len(model), 1)
-            model = model[0]
-        for name, param in model.named_parameters():
-            if name.endswith(('.query_key_value.weight', '.query_key_value.bias')):
-                if checkpoint_version == 0:
-                    fixed_param = _transpose_first_dim(param.data, 3, True, model)
-                elif checkpoint_version == 1.0:
-                    fixed_param = _transpose_first_dim(param.data, 3, False, model)
-                else:
-                    print_rank_0(f"Invalid checkpoint version {checkpoint_version}.")
-                    sys.exit()
-                param.data.copy_(fixed_param)
-            if name.endswith(('.key_value.weight', '.key_value.bias')):
-                if checkpoint_version == 0:
-                    fixed_param = _transpose_first_dim(param.data, 2, True, model)
-                elif checkpoint_version == 1.0:
-                    fixed_param = _transpose_first_dim(param.data, 2, False, model)
-                else:
-                    print_rank_0(f"Invalid checkpoint version {checkpoint_version}.")
-                    sys.exit()
-                param.data.copy_(fixed_param)
-        print_rank_0(" succesfully fixed query-key-values ordering for"
-                     " checkpoint version {}".format(checkpoint_version))
-
-
-def read_tracker(load_dir):
-    args = get_args()
-    iteration = 0
-    release = False
-    # Read the tracker file and set the iteration.
-    tracker_filename = get_checkpoint_tracker_filename(load_dir)
-
-    # If no tracker file, return iteration zero.
-    if not os.path.isfile(tracker_filename):
-        print_rank_0('WARNING: could not find the metadata file {} '.format(
-            tracker_filename))
-        print_rank_0('    will not load any checkpoints and will start from '
-                     'random')
-        return False, iteration, release
-
-    # Otherwise, read the tracker file and either set the iteration or
-    # mark it as a release checkpoint.
-    with open(tracker_filename, 'r') as f:
-        metastring = f.read().strip()
-        try:
-            iteration = int(metastring)
-        except ValueError:
-            release = metastring == 'release'
-            if not release:
-                print_rank_0('ERROR: Invalid metadata file {}. Exiting'.format(
-                    tracker_filename))
-                sys.exit()
-
-    if not args.mos and not args.kd:
-        error_message = 'error parsing metadata file {}'.format(tracker_filename)
-        ensure_valid(iteration > 0 or release, error_message)
-    return True, iteration, release
-
-
-def get_state_dict_and_release(load_dir, lora_load_dir=None):
-    args = get_args()
-
-    read_tracker_success, iteration, release = read_tracker(load_dir)
-    if not read_tracker_success:
-        raise ValueError(f"{load_dir} do not have tracker.")
-    if lora_load_dir:
-        read_tracker_success, lora_iteration, lora_release = read_tracker(lora_load_dir)
-        if not read_tracker_success:
-            raise ValueError(f"{lora_load_dir} do not have tracker.")
-
-    # Checkpoint.
-    checkpoint_name = get_checkpoint_name(load_dir, iteration, release)
-    print_rank_0(f' loading checkpoint from {args.load} at iteration {iteration}')
-    model_checkpoint_name = None
-    if lora_load_dir:  # 有lora目录时，其他参数都应从lora目录读取，load目录只提供原始模型权重
-        model_checkpoint_name = checkpoint_name
-        checkpoint_name = get_checkpoint_name(lora_load_dir, lora_iteration, lora_release)
-        print_rank_0(
-            f' loading lora checkpoint from {args.lora_load} at iteration {lora_iteration} release:{lora_release}')
-        release = lora_release
-
-    # Load the checkpoint.
-    try:
-        state_dict = load_state_dict_from_checkpoint_with_megatron(checkpoint_name,
-                                                                   model_checkpoint_name=model_checkpoint_name)
-    except ModuleNotFoundError:
-        from megatron.fp16_deprecated import loss_scaler
-        # For backward compatibility.
-        print_rank_0(' > deserializing using the old code structure ...')
-        sys.modules['fp16.loss_scaler'] = sys.modules[
-            'megatron.fp16_deprecated.loss_scaler']
-        sys.modules['megatron.fp16.loss_scaler'] = sys.modules[
-            'megatron.fp16_deprecated.loss_scaler']
-        state_dict = load_state_dict_from_checkpoint_with_megatron(checkpoint_name,
-                                                                   model_checkpoint_name=model_checkpoint_name)
-        sys.modules.pop('fp16.loss_scaler', None)
-        sys.modules.pop('megatron.fp16.loss_scaler', None)
-    except BaseException as e:
-        print_rank_0('could not load the checkpoint')
-        traceback.print_exc()
-        sys.exit()
-
-    return state_dict, release, checkpoint_name
-
-
-def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True, load_only_weights=False):
-    """Load a model checkpoint and return the iteration.
-    strict (bool): whether to strictly enforce that the keys in
-        :attr:`state_dict` of the checkpoint match the names of
-        parameters and buffers in model.
-    """
-    args = get_args()
-    load_dir = getattr(args, load_arg)
-    lora_load_dir = getattr(args, 'lora_load')
-
-    if args.deepspeed:
-        if not os.path.exists(load_dir):
-            print_rank_0(f"WARNING: could not find the metadata file {load_dir}")
-            print_rank_0(f" will not load any checkpoints and will start from random")
-            return 0
-        custom_load_fn, load_dir = get_custom_load_fn(model=model[0], load_dir=load_dir, lora_load_dir=lora_load_dir)
-        if args.no_pipeline_parallel:
-            load_zero_optim = sum(['zero' in file for file in os.listdir(load_dir)]) > 0
-        else:
-            load_zero_optim = sum(['global' in file for file in os.listdir(load_dir)]) > 0
-        release = not load_zero_optim
-        loaded_dir, state_dict = model[0].load_checkpoint(
-            load_dir,
-            # It is only loaded not strictly when lora is turned on and the original model is loaded.
-            load_module_strict=not (release and is_enable_lora()),
-            load_module_only=not load_zero_optim,
-            load_optimizer_states=load_zero_optim,
-            load_lr_scheduler_states=load_zero_optim,
-            custom_load_fn=custom_load_fn
-        )
-        if loaded_dir is None:
-            print_rank_0(f"WARNING: could not find the metadata file {load_dir}")
-            print_rank_0(f" will not load any checkpoints and will start from random")
-            return 0
-        checkpoint_name = loaded_dir  # 开启lora时主要参数会从lora_load里读取，所以最后打印时用checkpoint_name传递
-    else:
-        unwrap_model_classes = (torchDDP, LocalDDP, Float16Module)
-        if is_enable_lora():
-            unwrap_model_classes += get_lora_model_classes()
-        model = utils.unwrap_model(model, unwrap_model_classes)
-
-        try:
-            state_dict, release, checkpoint_name = get_state_dict_and_release(load_dir=load_dir,
-                                                                              lora_load_dir=lora_load_dir)
-        except ValueError as e:
-            print_rank_0(f"{e}")
-            return 0
-
-    # set checkpoint version
-    set_checkpoint_version(state_dict.get('checkpoint_version', 0))
-
-    # Set iteration.
-    if args.finetune or release or args.reset_iteration or load_only_weights:
-        iteration = 0
-        # Make DeepSpeed engine aware of this reset of iteration
-        model[0].global_steps = 0
-    else:
-        iteration = load_iteration_from_state_dict(state_dict, checkpoint_name)
-
-    # Check arguments.
-    reset_train_valid_samples = args.reset_iteration
-    if not load_only_weights and not reset_train_valid_samples:
-        check_equal(args.consumed_train_samples, 0)
-        check_equal(args.consumed_valid_samples, 0)
-        if 'args' in state_dict:
-            checkpoint_args = state_dict['args']
-            check_checkpoint_args(checkpoint_args)
-            args.consumed_train_samples = getattr(checkpoint_args,
-                                                  'consumed_train_samples', 0)
-            update_num_microbatches(consumed_samples=args.consumed_train_samples)
-            args.consumed_valid_samples = getattr(checkpoint_args,
-                                                  'consumed_valid_samples', 0)
-        else:
-            print_rank_0('could not find arguments in the checkpoint ...')
-
-    # Model.
-    if not args.deepspeed:
-        if is_enable_lora() and iteration == 0:
-            strict = False
-        if len(model) == 1:
-            result = model[0].load_state_dict(state_dict['model'], strict=strict)
-            if strict and result:
-                print_rank_0(f"load checkpoint result:{result}")
-        else:
-            for i in range(len(model)):
-                parallel_state.set_virtual_pipeline_model_parallel_rank(i)
-                model[i].load_state_dict(state_dict['model%d' % i], strict=strict)
-
-    # Fix up query/key/value matrix ordering if needed
-    checkpoint_version = get_checkpoint_version()
-    print_rank_0(f' checkpoint version {checkpoint_version}')
-    fix_query_key_value_ordering(model, checkpoint_version)
-
-    # Optimizer.
-    if not args.deepspeed:
-        if not release and not args.finetune and not args.no_load_optim:
-            load_optimizer_from_state_dict(optimizer, lr_scheduler, state_dict, checkpoint_name)
-
-    # rng states.
-    if not release and not args.finetune and not args.no_load_rng:
-        try:
-            # Load distributed optimizer's custom parameter state.
-            if args.use_distributed_optimizer:
-                tracker_filename = get_checkpoint_tracker_filename(load_dir)
-                iteration, release = read_metadata(tracker_filename)
-                model_checkpoint_name = \
-                    get_checkpoint_name(load_dir, iteration, release)
-                optim_checkpoint_name = \
-                    get_distributed_optimizer_checkpoint_name(
-                        model_checkpoint_name)
-                optimizer.load_parameter_state(optim_checkpoint_name, iteration)
-
-            random.setstate(state_dict['random_rng_state'])
-            np.random.set_state(state_dict['np_rng_state'])
-            torch.set_rng_state(state_dict['torch_rng_state'])
-            get_accelerator().set_rng_state(state_dict['cuda_rng_state'])
-            # Check for empty states array
-            if not state_dict['rng_tracker_states']:
-                raise KeyError
-            tensor_parallel.get_cuda_rng_tracker().set_states(
-                state_dict['rng_tracker_states'])
-        except KeyError:
-            print_rank_0('Unable to load rng state from checkpoint {}. '
-                         'Specify --no-load-rng or --finetune to prevent '
-                         'attempting to load the rng state, '
-                         'exiting ...'.format(checkpoint_name))
-            sys.exit()
-
-    # Some utilities want to load a checkpoint without distributed being initialized
-    if torch.distributed.is_initialized():
-        torch.distributed.barrier()
-
-    print_rank_0(f'  successfully loaded checkpoint from {checkpoint_name} at iteration {iteration}')
-
-    return iteration
-
-
-def get_custom_load_fn(model, load_dir, lora_load_dir=None):
-    custom_load_fn = None
-
-    if isinstance(model, PipelineEngine):
-        return custom_load_fn, load_dir
-
-    if is_enable_lora():
-        if lora_load_dir:
-            custom_load_fn = get_lora_load_fn_with_deepspeed(model=model, base_model_load_dir=load_dir)
-            load_dir = lora_load_dir
-        else:
-            custom_load_fn = lora_custom_load_fn_for_deepspeed
-    return custom_load_fn, load_dir
-
-
-def load_optimizer_from_state_dict(optimizer, lr_scheduler, state_dict, checkpoint_name):
-    args = get_args()
-
-    try:
-        if optimizer is not None:
-            optimizer.load_state_dict(state_dict['optimizer'])
-        if lr_scheduler is not None and not args.no_load_lr_state:
-            lr_scheduler.load_state_dict(state_dict['lr_scheduler'])
-    except KeyError:
-        print_rank_0('Unable to load optimizer from checkpoint {}. '
-                     'Specify --no-load-optim or --finetune to prevent '
-                     'attempting to load the optimizer state, '
-                     'exiting ...'.format(checkpoint_name))
-        sys.exit()
-
-
-def load_iteration_from_state_dict(state_dict, checkpoint_name):
-    args = get_args()
-
-    try:
-        iteration = state_dict['iteration']
-        if 'tokens' in state_dict:
-            args.consumed_train_tokens = state_dict['tokens']
-    except KeyError:
-        try:  # Backward compatible with older checkpoints
-            iteration = state_dict['total_iters']
-        except KeyError:
-            print_rank_0('A metadata file exists but unable to load '
-                         'iteration from checkpoint {}, exiting'.format(
-                checkpoint_name))
-            sys.exit()
-    return iteration
-
-
-def load_state_dict_from_checkpoint_with_megatron(checkpoint_name, model_checkpoint_name=None):
-    state_dict = torch.load(checkpoint_name, map_location='cpu')
-    if model_checkpoint_name:
-        model_state_dict = torch.load(model_checkpoint_name, map_location='cpu')
-        state_dict = update_model_state_dict_with_megatron(model_state_dict=model_state_dict,
-                                                           state_dict=state_dict)
-    state_dict = handle_lora_modules_to_save_key_with_megatron(state_dict)
-    return state_dict
-
-
-def load_biencoder_checkpoint(model, only_query_model=False,
-                              only_context_model=False, custom_load_path=None):
-    """
-    selectively load retrieval models for indexing/retrieving
-    from saved checkpoints
-    """
-
-    args = get_args()
-
-    model = utils.unwrap_model(model)
-
-    load_path = custom_load_path if custom_load_path is not None else args.load
-
-    tracker_filename = get_checkpoint_tracker_filename(load_path)
-    with open(tracker_filename, 'r') as f:
-        iteration = int(f.read().strip())
-
-    checkpoint_name = get_checkpoint_name(load_path, iteration, False)
-    if parallel_state.get_data_parallel_rank() == 0:
-        print('global rank {} is loading checkpoint {}'.format(
-            torch.distributed.get_rank(), checkpoint_name))
-
-    state_dict = torch.load(checkpoint_name, map_location='cpu')
-    ret_state_dict = state_dict['model']
-
-    if only_query_model:
-        ret_state_dict.pop('context_model')
-    if only_context_model:
-        ret_state_dict.pop('query_model')
-
-    check_equal(len(model), 1)
-    model[0].load_state_dict(ret_state_dict)
-    torch.distributed.barrier()
-
-    if parallel_state.get_data_parallel_rank() == 0:
-        print(' successfully loaded {}'.format(checkpoint_name))
-
-    return model
--- a/modellink/core/init.py
+++ b/modellink/core/init.py
@ -1,10 +0,0 @@
-import modellink.core.parallel_state
-import modellink.core.utils
-
-from .inference_params import InferenceParams
-from .model_parallel_config import ModelParallelConfig
-
-# Alias parallel_state as mpu, its legacy name
-mpu = parallel_state
-
-__all__ = ["parallel_state", "utils", "InferenceParams", "ModelParallelConfig"]
--- a/modellink/core/enums.py
+++ b/modellink/core/enums.py
@ -1,33 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-
-
-import enum
-
-
-class ModelType(enum.Enum):
-    encoder_or_decoder = 1
-    encoder_and_decoder = 2
-    retro_encoder = 3
-    retro_decoder = 4
-
-
-class LayerType(enum.Enum):
-    encoder = 1
-    decoder = 2
-    retro_encoder = 3
-    retro_decoder = 4
-    retro_decoder_with_retriever = 5
-
-
-class AttnType(enum.Enum):
-    self_attn = 1
-    cross_attn = 2
-
-
-class AttnMaskType(enum.Enum):
-    padding = 1
-    causal = 2  # Overrides `attention_mask` to be a lower triangular matrix
-    prefix = 3
-    # Forces one to pass an `attention_mask` that's 1 if we need to mask.
-    # Tensor that can be broadcast to [micro_batch_size, n_head, seq_length, seq_length]
-    custom = 4
--- a/modellink/core/inference_params.py
+++ b/modellink/core/inference_params.py
@ -1,29 +0,0 @@
-from modellink.error_utils import check_equal
-
-
-class InferenceParams:
-    """Inference parameters that are passed to the main model in order
-    to efficienly calculate and store the context during inference."""
-
-    def __init__(self, max_batch_size, max_sequence_length):
-        self.max_sequence_length = max_sequence_length
-        self.max_batch_size = max_batch_size
-        self.sequence_len_offset = 0
-        self.batch_size_offset = 0
-        self.key_value_memory_dict = {}
-
-    def swap_key_value_dict(self, batch_idx):
-        """swap between batches"""
-        if len(self.key_value_memory_dict) == 0:
-            raise ValueError("Should not swap when dict in empty.")
-
-        for layer_number in self.key_value_memory_dict.keys():
-            inference_key_memory, inference_value_memory = self.key_value_memory_dict[layer_number]
-            check_equal(len(batch_idx), inference_key_memory.shape[1])
-            # make sure batch size is the same
-            new_inference_key_memory = inference_key_memory[:, batch_idx]
-            new_inference_value_memory = inference_value_memory[:, batch_idx]
-            self.key_value_memory_dict[layer_number] = (
-                new_inference_key_memory,
-                new_inference_value_memory,
-            )
--- a/modellink/core/memory/init.py
+++ b/modellink/core/memory/init.py
--- a/modellink/core/memory/auto_recomputing/init.py
+++ b/modellink/core/memory/auto_recomputing/init.py
--- a/modellink/core/memory/auto_recomputing/autorecompute.py
+++ b/modellink/core/memory/auto_recomputing/autorecompute.py
@ -1,231 +0,0 @@
-import time
-
-import torch
-import torch.nn
-import torch_npu
-
-from modellink import print_rank_0
-from modellink.arguments import parse_args
-from modellink.core import parallel_state
-from .autorecompute_apply import hook_checkpoint_forward as checkpoint_forward
-from .autorecompute_apply import register_recursive_apply as apply_autorecompute
-from .autorecompute_solver import solve_graph
-
-
-# design of auto-recompute
-
-# # Workflow
-# step 1: profile the computation process to gain the computation of each ``module``, using all recompute
-# step 2: get the graph and solve the recompute plan
-# step 3: apply the recompute plan
-
-# Detail workflow
-# step1: intercept the model to hook forward, step, and register pytorch forward hooks to get profiling information
-
-# information needed:
-# 1) forward time of all module,
-# 2) memory consumption of checkpointed memory and all memory;
-# 3) static memory size (after the first step)
-# 4) # of layers (can be gained by total mem / per layer mem)
-
-# step2: do the average computation of all graph
-# step3: solve
-# step4: apply
-# 1) hook each module, change the hook forward of the recomputed module to use mpu.checkpoint
-# 2) remove the forward hook to remove profiling function
-
-
-class AutoRecompute:
-    auto_recomputing = None
-
-    def __init__(self):
-        # layer profiling info
-        self.context = {
-            'module': []
-        }
-        # save origin modules
-        self.checkpointed_modules = []
-        # save modules hook, remove it after apply policy
-        self.modules_hooks = []
-        # current profiling step
-        self.profiling_step = 0
-        # step for stop profiling, default is 10
-        self.stop_profiling_step = 10
-        # min step for stop profiling
-        self.min_profiling_step = 5
-        # step for solve graph by auto recompute, after step for stop profiling
-        self.solve_graph_at_step = 11
-        # unit for device memory size(MB)
-        self.unit_mb = 1024 * 1024
-
-    @staticmethod
-    def get_memory_status():
-        used_memory = torch.npu.memory_allocated()
-        reserved_memory = torch.npu.memory_reserved()
-        return used_memory, reserved_memory
-
-    def _cal_tensor_size(self, tensor):
-        try:
-            return tensor.numel() * tensor.element_size() / self.unit_mb
-        except ZeroDivisionError:
-            return 0
-
-    def pre_hook_func(self, state, sync: bool, *args, **kargs):
-        if sync:
-            torch.npu.synchronize()
-        used_memory, _ = self.get_memory_status()
-        torch.npu.reset_max_memory_allocated()
-        state['memory'] = used_memory
-        state['time'] = time.time()
-        size = 0
-        for arg in args:
-            if isinstance(arg, torch.Tensor):
-                size += self._cal_tensor_size(arg)
-            elif isinstance(arg, tuple) or isinstance(arg, list):
-                for t in arg:
-                    if isinstance(t, torch.Tensor):
-                        size += self._cal_tensor_size(t)
-        state['input'] = size
-
-    def post_hook_func(self, state, sync: bool, *args, **kargs):
-        if sync:
-            torch.npu.synchronize()
-        used_memory, _ = self.get_memory_status()
-        max_mem = torch.npu.max_memory_allocated()
-        state['peak_memory'] = max_mem - state['memory']
-        state['memory'] = (used_memory - state['memory']) // self.unit_mb
-        if 'pre_total_time' in state:
-            state['forward_cnt'] += 1
-            state['time'] = (time.time() - state['time']) * 1000
-            state['pre_total_time'] += state['time']
-            try:
-                state['time'] = state['pre_total_time'] / state['forward_cnt']
-            except ZeroDivisionError:
-                state['time'] = 0
-        else:
-            state['forward_cnt'] = 0
-            state['time'] = (time.time() - state['time']) * 1000
-            state['pre_total_time'] = 0
-
-    def forward_pre_hook(self, name, parent_ctx, ctx):
-        if self.profiling_step < self.stop_profiling_step:
-            ctx['name'] = name
-            if 'layers' in parent_ctx:
-                parent_ctx['layers'].append(ctx)
-
-        def hook(module, *args, **kargs):
-            if self.profiling_step < self.stop_profiling_step:
-                if 'module' in self.context:
-                    self.context['module'].append(ctx)
-                self.pre_hook_func(ctx, True, *args, **kargs)
-
-        return hook
-
-    def forward_post_hook(self, ctx):
-        def hook(module, *args, **kargs):
-            if self.profiling_step < self.stop_profiling_step:
-                self.post_hook_func(ctx, True, *args)
-                if 'module' in self.context:
-                    self.context['module'].pop()
-
-        return hook
-
-    def register_recursive_hook(self, prefix_name, model, ctx):
-        for name, module in model.named_children():
-            if str.isdigit(name) and name != "0":
-                # transformer layer
-                module.no_checkpoint_forward = module.forward
-                module.forward = checkpoint_forward(module.forward)
-                self.checkpointed_modules.append(module)
-
-            if 'layers' not in ctx:
-                ctx['layers'] = []
-            current_ctx = {}
-
-            next_name = prefix_name + "." + name if prefix_name != "" else name
-            pre_hook = module.register_forward_pre_hook(self.forward_pre_hook(name, ctx, current_ctx))
-            post_hook = module.register_forward_hook(self.forward_post_hook(current_ctx))
-            self.modules_hooks.append(pre_hook)
-            self.modules_hooks.append(post_hook)
-            self.register_recursive_hook(next_name, module, current_ctx)
-
-    def step_hook(self, model):
-        self.profiling_step += 1
-        if self.profiling_step == self.solve_graph_at_step:
-            print_rank_0("AUTO-RECOMPUTE: solving recompute policy")
-            print_rank_0("==================== AUTO-RECOMPUTE Report ====================")
-            all_args = parse_args()
-            solve_graph(self.context, parallel_state.get_pipeline_model_parallel_world_size(),
-                        all_args.auto_recompute_device_size)
-            print_rank_0("==================== AUTO-RECOMPUTE Report End ====================")
-            for m in self.checkpointed_modules:
-                m.forward = m.no_checkpoint_forward
-            self.checkpointed_modules.clear()
-            print_rank_0("AUTO-RECOMPUTE: applying policy to the model")
-            apply_autorecompute("module", model, self.context)
-            print_rank_0("AUTO-RECOMPUTE: applying policy to the model fin")
-            for hook_handle in self.modules_hooks:
-                hook_handle.remove()
-            self.modules_hooks.clear()
-
-    def hook_step_func(self, step_func, models):
-        def custom_step_func(*args, **kargs):
-            result = step_func(*args, **kargs)
-            if self.profiling_step < self.stop_profiling_step:
-                used_memory, reserved_memory = self.get_memory_status()
-                self.context['used_mem'] = used_memory // self.unit_mb
-            if isinstance(models, list):
-                for model in models:
-                    self.step_hook(model)
-            else:
-                self.step_hook(models)
-            return result
-
-        return custom_step_func
-
-    def set_profiling_step(self, step):
-        self.stop_profiling_step = step
-        self.solve_graph_at_step = step + 1
-
-    def is_enabled_auto_recompute(self):
-        all_args = parse_args()
-        if all_args.auto_recompute_device_size <= 0:
-            return False
-        if all_args.checkpoint_activations:
-            print_rank_0("[ERROR] failed to start auto selective recompute train, please remove param: "
-                         "\"checkpoint-activations\".")
-            return False
-        if all_args.auto_recompute_profiling_step < 5:
-            print_rank_0("[ERROR] failed to start auto selective recompute train, please set param >=5 or remove it: "
-                         "\"auto-recompute-profiling-step\".")
-            return False
-
-        self.set_profiling_step(all_args.auto_recompute_profiling_step)
-        print_rank_0(
-            "success to stat auto recompute train: auto-recompute-device-size={}, auto-recompute-profiling-step={}".format(
-                all_args.auto_recompute_device_size, all_args.auto_recompute_profiling_step))
-        return True
-
-
-def get_auto_recomputing():
-    if AutoRecompute.auto_recomputing is None:
-        AutoRecompute.auto_recomputing = AutoRecompute()
-    return AutoRecompute.auto_recomputing
-
-
-def autorecompute_profile(setup_model_and_optimizer_func):
-    def get_model_hook_func(*args, **kargs):
-        models, optimizer, lr_scheduler = setup_model_and_optimizer_func(*args, **kargs)
-        recomputing = get_auto_recomputing()
-        if not recomputing.is_enabled_auto_recompute():
-            return models, optimizer, lr_scheduler
-        optimizer.step = recomputing.hook_step_func(optimizer.step, models)
-        if isinstance(models, list):
-            for model in models:
-                recomputing.register_recursive_hook("module", model, recomputing.context)
-        else:
-            recomputing.register_recursive_hook("module", models, recomputing.context)
-        print_rank_0("AUTO-RECOMPUTE: successfully hooking module")
-        return models, optimizer, lr_scheduler
-
-    return get_model_hook_func
--- a/modellink/core/memory/auto_recomputing/autorecompute_apply.py
+++ b/modellink/core/memory/auto_recomputing/autorecompute_apply.py
@ -1,22 +0,0 @@
-from modellink.core import tensor_parallel
-
-
-def hook_checkpoint_forward(forward_func):
-    def custom_forward(*args, **kargs):
-        def inside_forward(*args):
-            return forward_func(*args, **kargs)
-
-        return tensor_parallel.checkpoint(inside_forward, None, *args)
-
-    return custom_forward
-
-
-def register_recursive_apply(layer_name, model, ctx):
-    idx = 0
-    if 'recompute' in ctx and ctx['recompute']:
-        model.forward = hook_checkpoint_forward(model.forward)
-    else:
-        for name, module in model.named_children():
-            next_name = layer_name + "." + name if layer_name != "" else name
-            register_recursive_apply(next_name, module, ctx['layers'][idx])
-            idx += 1
--- a/modellink/core/memory/auto_recomputing/autorecompute_solver.py
+++ b/modellink/core/memory/auto_recomputing/autorecompute_solver.py
@ -1,403 +0,0 @@
-import sys
-
-import networkx as nx
-import torch
-import torch_npu
-
-from modellink import print_rank_0
-from modellink.core import parallel_state
-
-
-class GraphSolver:
-    def __init__(self):
-        self.total_recompute_cost = 0
-        self.total_forward_cost = 0
-        self.layers_module = None
-        self.transformer_module = None
-        self.recompute_policy = {}
-        self.layers_combination = []
-        self.layer_full_recompute_combination = None
-        self.layer_without_recompute_combination = None
-        self.layer_recompute_one_combination = None
-
-    @staticmethod
-    def get_recompute_op(graph):
-        recompute_nodes = []
-        for node in graph.nodes:
-            if graph.nodes[node]['recompute']:
-                recompute_nodes.append(graph.nodes[node]['name'])
-        return recompute_nodes
-
-    @staticmethod
-    def dg_init(no_recompute_layer):
-        dg = nx.DiGraph()
-        dg.add_nodes_from([
-            (i, {"name": no_recompute_layer[i]['name'],
-                 "mem": no_recompute_layer[i]['memory'],
-                 "input": no_recompute_layer[i]['input'],
-                 "compute": no_recompute_layer[i]['time'],
-                 "recompute": False,
-                 "status": "no_status"})
-            for i in range(len(no_recompute_layer))
-        ])
-        dg.add_edges_from([
-            (i, i + 1) for i in range(len(no_recompute_layer) - 1)
-        ])
-        return dg
-
-    @staticmethod
-    def broadcast_recompute_policy_in_mp(recompute_policy_list):
-        recompute_policy_tensor = torch.tensor(recompute_policy_list,
-                                               device=torch.device(f'npu:{torch.distributed.get_rank() % 8}'))
-        torch.distributed.broadcast(recompute_policy_tensor, src=parallel_state.get_tensor_model_parallel_src_rank(),
-                                    group=parallel_state.get_tensor_model_parallel_group())
-        return recompute_policy_tensor.cpu().numpy().tolist()
-
-    def set_recompute_info_to_module(self, module, recompute_nodes, recompute):
-        if not recompute:
-            module["recompute"] = False
-            for layer in module["layers"]:
-                layer["recompute"] = False
-            return
-        if len(recompute_nodes) == 0:
-            module["recompute"] = True
-            return
-        sub_modules = module["layers"]
-        recompute_nodes_length = len(recompute_nodes)
-        for i in range(recompute_nodes_length):
-            if recompute_nodes[i] == self.layer_recompute_one_combination.broadcast_value:
-                sub_modules[i]["recompute"] = True
-                continue
-            sub_modules[i]["recompute"] = False
-
-    def apply_policy_to_model(self, recompute_policy_list):
-        full_layers = self.layers_module["layers"]
-        if len(recompute_policy_list) == 0:
-            return
-        idx = 0
-        for policy in recompute_policy_list:
-            n = policy[0]
-            recompute = False
-            recompute_nodes = []
-            if policy[1] != self.layer_without_recompute_combination.broadcast_value:
-                recompute = True
-            if policy[1] == self.layer_recompute_one_combination.broadcast_value:
-                recompute_nodes = policy[2:]
-            for i in range(idx, idx + n):
-                self.set_recompute_info_to_module(full_layers[i], recompute_nodes, recompute)
-            idx += n
-
-    # minimize the number of memory, results in all recompute
-    def calculate_cost_mem(self, g: nx.DiGraph, idx):
-        subtotal_cost = 0
-        subtotal_compute_cost = 0
-        cost = (g.nodes[idx]['mem'] if not g.nodes[idx]['recompute'] else g.nodes[idx]['input'])
-        compute_cost = (g.nodes[idx]['compute'] if g.nodes[idx]['recompute'] else 0)
-
-        successors = g.successors(idx)
-        successor_cnt = 0
-        for successor in successors:
-            a, b = self.calculate_cost_mem(g, successor)
-            subtotal_cost += a
-            subtotal_compute_cost += b
-            successor_cnt += 1
-
-        return subtotal_cost + cost, subtotal_compute_cost + compute_cost
-
-    # compute the size of peek memory for a given recompute graph
-    def calculate_cost_peek(self, g: nx.DiGraph, idx, recompute_mem, chp_mem):
-        recompute = g.nodes[idx]['recompute']
-        op_mem = g.nodes[idx]['mem']
-        op_input = g.nodes[idx]['input']
-
-        if recompute:
-            recompute_mem += op_mem
-            chp_mem = chp_mem + op_input
-        else:
-            recompute_mem = 0
-            chp_mem += op_mem + op_input
-
-        successors = g.successors(idx)
-        successor_cnt = 0
-
-        cur_max_mem = chp_mem + recompute_mem
-        global_max_mem = cur_max_mem
-        for successor in successors:
-            # if another subpath has not been calcuated, we will need to keep the stash
-            c = self.calculate_cost_peek(g, successor, recompute_mem, chp_mem)
-            # if another subpath has been calculated, shall we keep the output?
-            if c > global_max_mem:
-                global_max_mem = c
-            successor_cnt += 1
-        return global_max_mem
-
-    def cal_non_transformer_memory(self, model):
-        # total memory used
-        model_memory = model['layers'][0]['memory']
-        transformer_layer_memory = self.transformer_module['memory']
-        non_size = model_memory - transformer_layer_memory
-        print_rank_0(f"non size {model_memory} {non_size}")
-        return non_size
-
-    def layers_combination_init(self, g, idx, config):
-        if idx == 0:
-            self.layer_full_recompute_combination = LayerCombination({
-                "name": "full_recompute",
-                "num": config["nlayer"],
-                "memory": config["chp_input"],
-                "cost": config["chp_time"],
-                "broadcast_value": 0,
-                "policy_name": "n_full"
-            })
-            self.layers_combination.append(self.layer_full_recompute_combination)
-            self.layer_without_recompute_combination = LayerCombination({
-                "name": "without_recompute",
-                "num": config["nlayer"],
-                "memory": config["full_activation"],
-                "cost": 0,
-                "broadcast_value": 2,
-                "policy_name": "n_without"
-            })
-            self.layers_combination.append(self.layer_without_recompute_combination)
-        if idx >= len(config['layers']):
-            recompute_nodes = self.get_recompute_op(g)
-            if len(recompute_nodes) == len(config['layers']) or len(recompute_nodes) == 0:
-                return
-            stash_mem_per_layer, recompute_cost = self.calculate_cost_mem(g, 0)
-            self.layer_recompute_one_combination = LayerCombination({
-                "name": ",".join(recompute_nodes),
-                "num": config["nlayer"],
-                "memory": stash_mem_per_layer,
-                "cost": recompute_cost,
-                "broadcast_value": 1,
-                "policy_name": "n_selective"
-            })
-            self.layers_combination.append(self.layer_recompute_one_combination)
-            return
-        g.nodes[idx]['recompute'] = False
-        self.layers_combination_init(g, idx + 1, config)
-        g.nodes[idx]['recompute'] = True
-        self.layers_combination_init(g, idx + 1, config)
-
-    def get_max_goods_value(self, idx, ans, config):
-        i, j, k = idx[0], idx[1], idx[2]
-        pre_step_ans = ans[i - 1][j - k]
-        if k == 0:
-            return pre_step_ans
-
-        goods_value = ans[i][j]
-        memory = pre_step_ans.memory + k * self.layers_combination[i].memory
-        cost = pre_step_ans.cost + k * self.layers_combination[i].cost
-        if pre_step_ans.cost == float('inf'):
-            cost = k * self.layers_combination[i].cost
-        try:
-            device_memory = max(config["device_memory"] - config["static_memory_layer"], 0) / config["pp"]
-        except ZeroDivisionError:
-            device_memory = max(config["device_memory"] - config["static_memory_layer"], 0)
-            print_rank_0("[ERROR] pipeline model parallel world size is 0. ")
-
-        if device_memory >= memory and cost <= goods_value.cost:
-            goods_value.memory = memory
-            goods_value.cost = cost
-            goods_value.layer_names.clear()
-            if len(pre_step_ans.layer_names) > 0:
-                goods_value.layer_names.extend(pre_step_ans.layer_names)
-            goods_value.layer_names.extend(self.layers_combination[i].name for _ in range(k))
-
-        return goods_value
-
-    def print_recompute_policy(self, memory, cost, config):
-        fmt_str = "With selective recompute:\n"
-        for k, v in self.recompute_policy.items():
-            if k == self.layer_full_recompute_combination.name:
-                policy_name = self.layer_full_recompute_combination.policy_name
-            elif k == self.layer_without_recompute_combination.name:
-                policy_name = self.layer_without_recompute_combination.policy_name
-            else:
-                policy_name = self.layer_recompute_one_combination.policy_name
-                fmt_str += "recomputeNodes=[{}], ".format(k)
-            fmt_str += "{} {}; ".format(v, policy_name)
-        all_recompute_cost = len(self.layers_module["layers"]) * self.layer_full_recompute_combination.cost
-        try:
-            performance = (all_recompute_cost - cost) / (all_recompute_cost * 4)
-        except ZeroDivisionError:
-            performance = 0
-            print_rank_0("[ERROR] all recompute cost is 0. ")
-        fmt_str += "\ntotal mem cost: {:.1f} GiB + {:.1f} GiB, speed up compared with all recompute {:.2%}".format(
-            config["static_memory_layer"] / 1024, memory * config["pp"] / 1024, performance)
-        print_rank_0(fmt_str)
-
-    def get_all_layer_policy(self, combination_num, layer_num, ans, config):
-        layer_nodes = [self.layer_full_recompute_combination.name for _ in range(layer_num)]
-        memory = layer_num * self.layer_full_recompute_combination.memory
-        cost = layer_num * self.layer_full_recompute_combination.cost
-        for i in range(layer_num, 0, -1):
-            size = layer_num - len(ans[combination_num][i].layer_names)
-            if size != layer_num:
-                l_nodes = []
-                l_nodes.extend(ans[combination_num][i].layer_names)
-                # if the policies of all layers are not found, the remaining layers ues all recompute policy.
-                l_nodes.extend(self.layer_full_recompute_combination.name for _ in range(size))
-                l_memory = ans[combination_num][i].memory + size * self.layer_full_recompute_combination.memory
-                l_cost = ans[combination_num][i].cost + size * self.layer_full_recompute_combination.cost
-                if l_cost < cost:
-                    cost = l_cost
-                    memory = l_memory
-                    layer_nodes.clear()
-                    layer_nodes.extend(l_nodes)
-
-        for nodes in layer_nodes:
-            if nodes not in self.recompute_policy.keys():
-                self.recompute_policy.update({nodes: 1})
-                continue
-            self.recompute_policy.update({nodes: self.recompute_policy[nodes] + 1})
-
-        self.print_recompute_policy(memory, cost, config)
-
-    def knapsack_best(self, config):
-        combination_num = len(self.layers_combination)
-        layer_num = len(self.layers_module["layers"])
-        # make combination index id begin for 1.
-        self.layers_combination.insert(0, None)
-        # init ans
-        ans = [[GoodsValue() for _ in range(layer_num + 1)] for _ in range(combination_num + 1)]
-        # find max goods value
-        for i in range(1, combination_num + 1):
-            for j in range(layer_num + 1):
-                k = 0
-                while k <= self.layers_combination[i].num and k <= j:
-                    ans[i][j] = self.get_max_goods_value([i, j, k], ans, config)
-                    k += 1
-        self.get_all_layer_policy(combination_num, layer_num, ans, config)
-
-    def analyse_policy_to_list(self):
-        recompute_policy_list = []
-        full_module_layers = self.layers_module["layers"][0]["layers"]
-        module_layers_num = len(full_module_layers)
-        for nodes_name, v in self.recompute_policy.items():
-            nodes_count = [v]
-            if nodes_name == self.layer_without_recompute_combination.name:
-                broadcast_value = self.layer_without_recompute_combination.broadcast_value
-                nodes_count.extend(broadcast_value for _ in range(module_layers_num + 1))
-            elif nodes_name == self.layer_full_recompute_combination.name:
-                broadcast_value = self.layer_full_recompute_combination.broadcast_value
-                nodes_count.extend(broadcast_value for _ in range(module_layers_num + 1))
-            else:
-                nodes_count.append(self.layer_recompute_one_combination.broadcast_value)
-                recompute_nodes = nodes_name.split(",")
-                for layer in full_module_layers:
-                    if layer["name"] in recompute_nodes:
-                        nodes_count.append(self.layer_recompute_one_combination.broadcast_value)
-                        continue
-                    nodes_count.append(self.layer_without_recompute_combination.broadcast_value)
-            recompute_policy_list.append(nodes_count)
-        return recompute_policy_list
-
-    def print_list_to_policy(self, recompute_policy_list):
-        layer_names = self.layers_module["layers"][0]["layers"]
-        module_layers_num = len(layer_names)
-        if len(recompute_policy_list) == 0:
-            return
-        fmt_str = ">> final selective strategy <<\n"
-        for policy in recompute_policy_list:
-            n = policy[0]
-            if policy[1] == self.layer_without_recompute_combination.broadcast_value:
-                policy_name = self.layer_without_recompute_combination.policy_name
-            elif policy[1] == self.layer_full_recompute_combination.broadcast_value:
-                policy_name = self.layer_full_recompute_combination.policy_name
-            else:
-                policy_name = self.layer_recompute_one_combination.policy_name
-                policy = policy[2:]
-                nodes = []
-                for i in range(module_layers_num):
-                    if policy[i] == self.layer_recompute_one_combination.broadcast_value:
-                        nodes.append(layer_names[i]["name"])
-                fmt_str += "recomputeNodes=[{}], ".format(",".join(nodes))
-            fmt_str += "{} {}\n".format(n, policy_name)
-        print_rank_0(fmt_str)
-
-    def get_layers_module(self, model):
-        if "name" in model and model["name"] == "layers":
-            self.layers_module = model
-            return True
-        if "layers" not in model:
-            return False
-        has_transformer_layer = False
-        for sub_model in model["layers"]:
-            has_transformer_layer = (has_transformer_layer or self.get_layers_module(sub_model))
-        if has_transformer_layer:
-            self.transformer_module = model
-        return False
-
-
-class LayerCombination:
-    def __init__(self, config):
-        self.name = config["name"]
-        self.num = config["num"]
-        self.memory = config["memory"]
-        self.cost = config["cost"]
-        self.broadcast_value = config["broadcast_value"]
-        self.policy_name = config["policy_name"]
-
-
-class GoodsValue:
-    def __init__(self):
-        self.layer_names = []
-        self.memory = 0
-        self.cost = float('inf')
-
-
-def solve_graph(model, pp, device_memory):
-    solver = GraphSolver()
-    solver.get_layers_module(model)
-    solver.total_recompute_cost = sys.maxsize
-    # first layer is not recompute
-    total_model = solver.layers_module['layers'][0]
-    no_recompute_layer = total_model['layers']
-    full_chp_per_layer = total_model['input']
-    full_chp_time_per_layer = total_model['time']
-    full_activation = total_model['memory']
-
-    num_layers = len(solver.layers_module['layers'])
-    solver.total_forward_cost = full_chp_time_per_layer * num_layers
-    static_memory = model['used_mem'] + solver.cal_non_transformer_memory(model)
-
-    config = {
-        'nlayer': num_layers,
-        'static_memory_layer': static_memory,
-        'pp': pp,
-        'device_memory': device_memory,
-        'layers': no_recompute_layer,
-        'chp_input': full_chp_per_layer,
-        'full_activation': full_activation,
-        'chp_time': full_chp_time_per_layer
-    }
-    print_rank_0(
-        f"full input {full_chp_per_layer} full time {full_chp_time_per_layer} full activation {full_activation}")
-    generate_recompute_policy(solver, config)
-
-
-def generate_recompute_policy(solver, config):
-    num_layers = config["nlayer"]
-    full_chp_per_layer = config["chp_input"]
-    static_memory = config["static_memory_layer"]
-    no_recompute_layer = config["layers"]
-    dg = solver.dg_init(no_recompute_layer)
-    stash_mem_per_layer, _ = solver.calculate_cost_mem(dg, 0)
-    peek = solver.calculate_cost_peek(dg, 0, 0, 0)
-    stash_mem_total = stash_mem_per_layer * num_layers
-    print_rank_0(
-        f"Without recompute: total mem cost: {static_memory / 1024:.1f} GiB + {stash_mem_total / 1024:.1f} GiB + "
-        f"{peek / 1024:.1f} GiB, total recompute 0, speed up over all recompute 25%")
-
-    stash_mem_total = full_chp_per_layer * num_layers
-    print_rank_0(
-        f"With all recompute: total mem cost: {static_memory / 1024:.1f} GiB + {stash_mem_total / 1024:.1f} GiB + "
-        f"{peek / 1024:.1f} GiB, total recompute all")
-    solver.layers_combination_init(dg, 0, config)
-    solver.knapsack_best(config)
-    recompute_policy_new = solver.analyse_policy_to_list()
-    if parallel_state.get_tensor_model_parallel_world_size() > 1:
-        recompute_policy_new = solver.broadcast_recompute_policy_in_mp(recompute_policy_new)
-    solver.apply_policy_to_model(recompute_policy_new)
-    solver.print_list_to_policy(recompute_policy_new)
--- a/modellink/core/model_parallel_config.py
+++ b/modellink/core/model_parallel_config.py
@ -1,166 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-
-from dataclasses import dataclass
-from typing import Callable
-
-import torch
-
-
-@dataclass
-class ModelParallelConfig:
-    """Base configuration for Megatron Core
-
-    Model Parallelism
-    -----------------
-
-    tensor_model_parallel_size (int): Intra-layer model parallelism. Splits tensors across GPU ranks. Defaults to 1.
-
-    pipeline_model_parallel_size (int): Inter-layer model parallelism. Splits transformer layers across GPU
-        ranks. Defaults to 1.
-
-    virtual_pipeline_model_parallel_size (int): Interleaved pipeline parallelism is used to improve performance by
-        reducing the pipeline bubble.  Considers a transformer block as a list of smaller transformer (virtual) blocks.
-        The number of virtual blocks per pipeline model parallel rank is the virtual model parallel size.  See Efficient
-        Large-Scale Language Model Training on GPU Clusters Using Megatron-LM for more details.  Defaults to None.
-
-    sequence_parallel (bool): Makes tensor parallelism more memory efficient for LLMs (20B+) by
-        parallelizing layer norms and dropout sequentially.  See Reducing Activation Recomputation in Large Transformer
-        Models for more details. Defaults to False.
-
-    Initialization
-    --------------
-
-    perform_initialization (bool, default=True): If true, weights are initialized. This option can be useful when you
-        know you are going to load values from a checkpoint.
-
-    use_cpu_initialization: (bool, default=False): When set to False, we initialize the weights directly on the GPU.
-        Transferring weights from CPU to GPU can take a significant amount of time for large models. Defaults to False.
-
-    Training
-    --------
-
-    fp16 (bool): If true, train with fp16 mixed precision training. Defaults to False.
-
-    bf16 (bool): If true, train with bf16 mixed precision training. Defaults to False.
-
-    params_dtype (torch.dtype): dtype used when intializing the weights. Defaults to torch.float32
-
-    timers (optional, default=None): TODO
-
-    Optimizations
-    -------------
-
-    gradient_accumulation_fusion (bool): If true, fuses weight gradient accumulation to GEMMs. Requires the custom CUDA
-        extension fused_weight_gradient_mlp_cuda module. To use gradient_accumulation_fusion you must install APEX with
-        --cpp_ext and --cuda_ext. For example: "pip install --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\"
-        ". Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.
-        Defaults to False.
-
-    async_tensor_model_parallel_allreduce (bool, default=True): If true, enables asynchronous execution of
-        tensor-model-parallel all-reduce with weight gradient compuation of a column-linear layer.  Defaults to False.
-
-    Pipeline Parallelism
-    --------------------
-
-    pipeline_dtype (required): dtype used in p2p communication, usually params_dtype
-
-    grad_scale_func (optional, default=None): If using loss scaling, this function should take the loss and return the
-        scaled loss. If None, no function is called on the loss.
-
-    enable_autocast (bool): If true runs the forward step function inside torch.autocast context. Default is False.
-
-    autocast_dtype (torch.dtype): dtype to pass to torch.amp.autocast when enabled. Default is pipeline_dtype.
-    
-    variable_seq_lengths (bool, default=False): Support for variable sequence lengths across microbatches. Setting this
-        communicates the size of tensors during pipeline parallelism communication, because of this extra overhead it
-        should only be set if the sequence length varies by microbatch within a global batch.
-
-    num_microbatches_with_partial_activation_checkpoints (int, default=None): If int, set the number of microbatches
-        where not all of the layers will be checkpointed and recomputed. The rest of the microbatches within the window
-        of maximum outstanding microbatches will recompute all layers (either full recompute or selective recompute). If
-        None, the checkpoint and recompute will be left up to the forward_step function.
-
-    overlap_p2p_comm (bool, optional, default=False): When True some of the peer to peer communication for pipeline
-        parallelism will overlap with computation. Must be False if batch_p2p_comm is true.
-
-    batch_p2p_comm (bool, default=True): Use batch_isend_irecv instead of individual isend/irecv calls. Must be False
-        if overlap_p2p_comm is True.
-
-    batch_p2p_sync (bool, default=True): When using batch_isend_irecv, do a cuda.device.synchronize afterward to work
-        around a bug in older version of PyTorch.
-
-    use_ring_exchange_p2p (bool, default = False): Use custom ring_exchange kernel instead of
-        torch.distributed.batch_isend_irecv(). Requires custom built torch with torch.distributed.ring_exchange.
-
-    deallocate_pipeline_outputs (optional, default=False): If True, output data is deallocated after the tensor is sent
-        to the next pipeline stage.  Helps with saving memory, does nothing when pipeline parallel is not used.
-
-    no_sync_func (optional): Function that creates a context that suppresses asynchronous data-parallel
-        communication. If the model is an instance of torch.nn.DistributedDataParallel, the default is to use
-        torch.nn.DistributedDataParallel.no_sync.
-
-    grad_sync_func (optional): Function that launches asynchronous gradient reductions (e.g. distributed optimizer
-        gradient reduce-scatters). The function should take one argument: an iterable of parameters whose gradients are
-        to be synchronized.
-
-    param_sync_func (optional): Function that launches asynchronous parameter synchronizations (e.g. distributed
-        optimizer parameter all-gathers). The function should take one argument: an iterable of parameters to be
-        synchronized.
-
-    """
-
-    # Model parallelism
-    tensor_model_parallel_size: int = 1
-    pipeline_model_parallel_size: int = 1
-    virtual_pipeline_model_parallel_size: int = None
-    sequence_parallel: bool = False
-
-    # Initialization
-    perform_initialization: bool = True
-    use_cpu_initialization: bool = False
-
-    # Training
-    fp16: bool = False
-    bf16: bool = False
-    params_dtype: torch.dtype = torch.float32
-    timers: Callable = None
-
-    # Optimizations
-    gradient_accumulation_fusion: bool = False
-    async_tensor_model_parallel_allreduce: bool = False
-
-    # Pipeline Parallel
-    pipeline_dtype: torch.dtype = None
-    grad_scale_func: Callable = None
-    enable_autocast: bool = False
-    autocast_dtype: torch.dtype = None
-    variable_seq_lengths: bool = False
-    num_microbatches_with_partial_activation_checkpoints: int = None
-    overlap_p2p_comm: bool = False
-    batch_p2p_comm: bool = True
-    batch_p2p_sync: bool = True
-    use_ring_exchange_p2p: bool = False
-    deallocate_pipeline_outputs: bool = False
-    no_sync_func: Callable = None
-    grad_sync_func: Callable = None
-    param_sync_func: Callable = None
-
-    def __post_init__(self):
-        """
-        Python dataclass method that is used to modify attributes after initialization.
-        """
-        if self.sequence_parallel:
-            if self.tensor_model_parallel_size <= 1:
-                raise ValueError("Can not use sequence paralllelism without tensor parallelism")
-            if self.async_tensor_model_parallel_allreduce:
-                # sequence_parallelism already does this async
-                self.async_tensor_model_parallel_allreduce = False
-
-        if self.pipeline_model_parallel_size > 1:
-            if self.pipeline_dtype is None:
-                raise ValueError(
-                    "When using pipeline parallelism, pipeline_dtype must be specified"
-                )
-
-        if self.autocast_dtype is None:
-            self.autocast_dtype = self.params_dtype
--- a/modellink/core/models/init.py
+++ b/modellink/core/models/init.py
--- a/modellink/core/models/common/init.py
+++ b/modellink/core/models/common/init.py
--- a/modellink/core/models/common/rotary_pos_embedding.py
+++ b/modellink/core/models/common/rotary_pos_embedding.py
@ -1,88 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-
-import torch
-from torch import nn
-import torch_npu
-
-from modellink.error_utils import check_divisible_by_zero
-
-__all__ = ['RotaryEmbedding', 'apply_rotary_pos_emb']
-
-
-class RotaryEmbedding(nn.Module):
-    """
-    Rotary Embedding for language model.
-
-    Args:
-        kv_channels (int): Projection weights dimension in multi-head attention. Obtained from transformer config
-        rotary_percent (float): Percent of rotary dimension to use for rotary position embeddings.
-        seq_len_interpolation_factor (float, optional): scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None
-    """
-
-
-    def __init__(self, kv_channels, base=10000.0, rotary_percent=1.0, seq_len_interpolation_factor=None):
-        super().__init__()
-        dim = kv_channels
-        if rotary_percent < 1.0:
-            dim = int(dim * rotary_percent)
-        self.seq_len_interpolation_factor = seq_len_interpolation_factor
-        exponent = torch.arange(0, dim, 2).double().to(torch.npu.current_device()) / dim
-        self.inv_freq = 1.0 / (base ** exponent).float()
-
-    def forward(self, max_seq_len, offset=0):
-        """
-        Forward pass of RoPE embedding.
-
-        Args:
-            max_seq_len (int): Maximum size of sequence
-            offset (int, optional): _description_. Defaults to 0.
-
-        Returns:
-            Tensor: Embeddings after applying RoPE.
-        """
-        seq = torch.arange(max_seq_len, device=self.inv_freq.device) + offset
-        if self.seq_len_interpolation_factor is not None:
-            seq = seq.type_as(self.inv_freq)
-            seq *= 1 / self.seq_len_interpolation_factor
-        freqs = torch.outer(seq, self.inv_freq)
-        # first part even vector components, second part odd vector components,
-        #  2 * dim in dimension size
-        emb = torch.cat((freqs, freqs), dim=-1)
-        # emb [seq_length, .., dim]
-        return emb[:, None, None, :]
-
-
-def _rotate_half(x):
-    """
-    change sign so the last dimension becomes [-odd, +even]
-    """
-    x1, x2 = torch.chunk(x, 2, dim=-1)
-    return torch.cat((-x2, x1), dim=-1)
-
-
-def apply_rotary_pos_emb(t, freqs):
-    """
-    Apply rotary positional embedding to input tensor T.
-    Args:
-        t (Tensor): Input tensor T is of shape [seq_length, ... , dim]
-        freqs (Tensor): Rotary Positional embedding tensor freq is of shape [seq_length, ..., dim]
-
-    Returns:
-        Tensor: The input tensor after applying RoPE
-    """
-    rot_dim = freqs.shape[-1]
-    # ideally t_pass is empty so rotary pos embedding is applied to all tensor t
-    t, t_pass = t[..., :rot_dim], t[..., rot_dim:]
-
-    # first part is cosine component
-    # second part is sine component, need to change signs with _rotate_half method
-    cos_ = torch.cos(freqs).to(t.dtype)
-    sin_ = torch.sin(freqs).to(t.dtype)
-    t = (t * cos_) + (_rotate_half(t) * sin_)
-    return torch.cat((t, t_pass), dim=-1)
-
-
-def apply_fused_rotary_pos_emb(t, freqs):
-    cos = torch.cos(freqs)
-    sin = torch.sin(freqs)
-    return torch_npu.npu_rotary_mul(t, cos, sin).to(t.dtype)
--- a/modellink/core/parallel_state.py
+++ b/modellink/core/parallel_state.py
@ -1,664 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-# pylint: disable=global-statement
-
-"""Model and data parallel groups."""
-from typing import Optional
-import torch
-
-from modellink.error_utils import (
-    ensure_valid, 
-    ensure_var_is_none,
-    ensure_var_is_not_none, 
-    check_equal
-)
-from .utils import GlobalMemoryBuffer
-from ..global_vars import get_args
-
-# Intra-layer model parallel group that the current rank belongs to.
-_TENSOR_MODEL_PARALLEL_GROUP = None
-# Inter-layer model parallel group that the current rank belongs to.
-_PIPELINE_MODEL_PARALLEL_GROUP = None
-# Model parallel group (both intra- and pipeline) that the current rank belongs to.
-_MODEL_PARALLEL_GROUP = None
-# Embedding group.
-_EMBEDDING_GROUP = None
-# Position embedding group.
-_POSITION_EMBEDDING_GROUP = None
-# Data parallel group that the current rank belongs to.
-_DATA_PARALLEL_GROUP = None
-_DATA_PARALLEL_GROUP_GLOO = None
-# FP8 amax reduction group.
-_AMAX_REDUCTION_GROUP = None
-
-_VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = None
-_VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
-_PIPELINE_MODEL_PARALLEL_SPLIT_RANK = None
-_PIPELINE_PREV_GROUP = None
-_PIPELINE_NEXT_GROUP = None
-# These values enable us to change the mpu sizes on the fly.
-_MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE = None
-_MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
-_MPU_TENSOR_MODEL_PARALLEL_RANK = None
-_MPU_PIPELINE_MODEL_PARALLEL_RANK = None
-
-# A list of ranks that have a copy of the embedding.
-_EMBEDDING_GLOBAL_RANKS = None
-
-# A list of ranks that have a copy of the position embedding.
-_POSITION_EMBEDDING_GLOBAL_RANKS = None
-
-# A list of global ranks for each pipeline group to ease calculation of the source
-# rank when broadcasting from the first or last pipeline stage.
-_PIPELINE_GLOBAL_RANKS = None
-
-# A list of global ranks for each data parallel group to ease calculation of the source
-# rank when broadcasting weights from src to all other data parallel ranks
-_DATA_PARALLEL_GLOBAL_RANKS = None
-
-# Memory buffers to avoid dynamic memory allocation
-_GLOBAL_MEMORY_BUFFER = None
-
-
-def initialize_model_parallel(
-    tensor_model_parallel_size: int = 1,
-    pipeline_model_parallel_size: int = 1,
-    virtual_pipeline_model_parallel_size: Optional[int] = None,
-    pipeline_model_parallel_split_rank: Optional[int] = None,
-    use_fp8: bool = False,
-) -> None:
-    """
-    Initialize model data parallel groups.
-
-    Arguments:
-        tensor_model_parallel_size (int, default = 1):
-            The number of GPUs to split individual tensors across.
-
-        pipeline_model_parallel_size (int, default = 1):
-            The number of tensor parallel GPU groups to split the
-            Transformer layers across. For example, if
-            tensor_model_parallel_size is 4 and
-            pipeline_model_parallel_size is 2, the model will be split
-            into 2 groups of 4 GPUs.
-
-        virtual_pipeline_model_parallel_size (int, optional):
-            The number of stages that each pipeline group will have,
-            interleaving as necessary. If None, no interleaving is
-            performed. For example, if tensor_model_parallel_size is 1,
-            pipeline_model_parallel_size is 4,
-            virtual_pipeline_model_parallel_size is 2, and there are
-            16 transformer layers in the model, the model will be
-            split into 8 stages with two layers each and each GPU
-            would get 2 stages as such (layer number starting with 1):
-
-            GPU 0: [1, 2] [9, 10]
-            GPU 1: [3, 4] [11, 12]
-            GPU 2: [5, 6] [13, 14]
-            GPU 3: [7, 8] [15, 16]
-
-        pipeline_model_parallel_split_rank (int, optional):
-            For models with both an encoder and decoder, the rank in
-            pipeline to switch between encoder and decoder (i.e. the
-            first rank of the decoder). This allows the user to set
-            the pipeline parallel size of the encoder and decoder
-            independently. For example, if
-            pipeline_model_parallel_size is 8 and
-            pipeline_model_parallel_split_rank is 3, then ranks 0-2
-            will be the encoder and ranks 3-7 will be the decoder.
-
-        use_fp8 (bool, default = False):
-            Construct GPU groups needed for FP8 training, namely for
-            amax reduction across the product of the data-parallel and
-            tensor-parallel groups.
-
-    Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
-    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
-    the model pipeline. The present function will
-    create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
-    and 8 data-parallel groups as:
-        8 data_parallel groups:
-            [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
-        8 tensor model-parallel groups:
-            [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
-        4 pipeline model-parallel groups:
-            [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]
-    Note that for efficiency, the caller should make sure adjacent ranks
-    are on the same DGX box. For example if we are using 2 DGX-1 boxes
-    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
-    ranks 8 to 15 belong to the second box.
-    """
-    ensure_valid(not use_fp8, error_message="FP8 not supported by AscendSpeed")
-    if torch.distributed.get_rank() == 0:
-        print('> initializing tensor model parallel with size {}'.format(
-            tensor_model_parallel_size))
-        print('> initializing pipeline model parallel with size {}'.format(
-            pipeline_model_parallel_size))
-    # Get world size and rank. Ensure some consistencies.
-    ensure_valid(torch.distributed.is_initialized())
-    world_size: int = torch.distributed.get_world_size()
-
-    if world_size % (tensor_model_parallel_size * pipeline_model_parallel_size) != 0:
-        raise RuntimeError(
-            f"world_size ({world_size}) is not divisible by tensor_model_parallel_size "
-            f"({tensor_model_parallel_size}) x pipeline_model_parallel_size "
-            f"({pipeline_model_parallel_size})"
-        )
-
-    data_parallel_size: int = world_size // (
-        tensor_model_parallel_size * pipeline_model_parallel_size
-    )
-
-    num_tensor_model_parallel_groups: int = world_size // tensor_model_parallel_size
-    num_pipeline_model_parallel_groups: int = world_size // pipeline_model_parallel_size
-
-    if virtual_pipeline_model_parallel_size is not None:
-        if not pipeline_model_parallel_size > 2:
-            raise RuntimeError(
-                "pipeline-model-parallel size should be greater than 2 with interleaved schedule"
-            )
-        global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
-        global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-        _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = 0
-        _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = virtual_pipeline_model_parallel_size
-
-    if pipeline_model_parallel_split_rank is not None:
-        global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-        _PIPELINE_MODEL_PARALLEL_SPLIT_RANK = pipeline_model_parallel_split_rank
-
-    rank = torch.distributed.get_rank()
-
-    # Build the data-parallel groups.
-    global _DATA_PARALLEL_GROUP
-    global _DATA_PARALLEL_GROUP_GLOO
-    global _DATA_PARALLEL_GLOBAL_RANKS
-    ensure_var_is_none(_DATA_PARALLEL_GROUP, error_message='data parallel group is already initialized')
-    all_data_parallel_group_ranks = []
-    args = get_args()
-    for i in range(pipeline_model_parallel_size):
-        start_rank = i * num_pipeline_model_parallel_groups
-        end_rank = (i + 1) * num_pipeline_model_parallel_groups
-        for j in range(tensor_model_parallel_size):
-            ranks = range(start_rank + j, end_rank, tensor_model_parallel_size)
-            all_data_parallel_group_ranks.append(list(ranks))
-            group = torch.distributed.new_group(ranks)
-            if args.use_distributed_optimizer:
-                group_gloo = torch.distributed.new_group(ranks, backend="gloo")
-            if rank in ranks:
-                _DATA_PARALLEL_GROUP = group
-                _DATA_PARALLEL_GLOBAL_RANKS = ranks
-                if args.use_distributed_optimizer:
-                    _DATA_PARALLEL_GROUP_GLOO = group_gloo
-
-    # Build the model-parallel groups.
-    global _MODEL_PARALLEL_GROUP
-    ensure_var_is_none(_MODEL_PARALLEL_GROUP, error_message='model parallel group is already initialized')
-    for i in range(data_parallel_size):
-        ranks = [
-            data_parallel_group_ranks[i]
-            for data_parallel_group_ranks in all_data_parallel_group_ranks
-        ]
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            _MODEL_PARALLEL_GROUP = group
-
-    # Build the tensor model-parallel groups.
-    global _TENSOR_MODEL_PARALLEL_GROUP
-    ensure_var_is_none(_TENSOR_MODEL_PARALLEL_GROUP, error_message='tensor model parallel' \
-                                                     ' group is already initialized')
-    for i in range(num_tensor_model_parallel_groups):
-        ranks = range(i * tensor_model_parallel_size, (i + 1) * tensor_model_parallel_size)
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            _TENSOR_MODEL_PARALLEL_GROUP = group
-
-    # Build the pipeline model-parallel groups and embedding groups
-    # (first and last rank in each pipeline model-parallel group).
-    global _PIPELINE_MODEL_PARALLEL_GROUP
-    global _PIPELINE_GLOBAL_RANKS
-    global _PIPELINE_PREV_GROUP
-    global _PIPELINE_NEXT_GROUP
-    ensure_var_is_none(_PIPELINE_MODEL_PARALLEL_GROUP, error_message='pipeline model parallel' \
-                                                       ' group is already initialized')
-    global _EMBEDDING_GROUP
-    global _EMBEDDING_GLOBAL_RANKS
-    ensure_var_is_none(_EMBEDDING_GROUP, error_message='embedding group is already initialized')
-    global _POSITION_EMBEDDING_GROUP
-    global _POSITION_EMBEDDING_GLOBAL_RANKS
-    ensure_var_is_none(_POSITION_EMBEDDING_GROUP, error_message='position embedding' \
-                                                  ' group is already initialized')
-    for i in range(num_pipeline_model_parallel_groups):
-        ranks = range(i, world_size, num_pipeline_model_parallel_groups)
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            _PIPELINE_MODEL_PARALLEL_GROUP = group
-            _PIPELINE_GLOBAL_RANKS = ranks
-        for j in iter(range(len(ranks))):
-            ranks_ = [ranks[j], ranks[(j + 1) % len(ranks)]] if world_size != 1 else [ranks[j]]
-            group = torch.distributed.new_group(ranks_)
-            if rank == ranks[j]:
-                _PIPELINE_NEXT_GROUP = group
-            if rank == ranks[(j + 1) % len(ranks)]:
-                _PIPELINE_PREV_GROUP = group
-        # Setup embedding group (to exchange gradients between
-        # first and last stages).
-        if len(ranks) > 1:
-            embedding_ranks = [ranks[0], ranks[-1]]
-            position_embedding_ranks = [ranks[0]]
-            if pipeline_model_parallel_split_rank is not None:
-                if ranks[pipeline_model_parallel_split_rank] not in embedding_ranks:
-                    embedding_ranks = [
-                        ranks[0],
-                        ranks[pipeline_model_parallel_split_rank],
-                        ranks[-1],
-                    ]
-                if ranks[pipeline_model_parallel_split_rank] not in position_embedding_ranks:
-                    position_embedding_ranks = [ranks[0], ranks[pipeline_model_parallel_split_rank]]
-        else:
-            embedding_ranks = ranks
-            position_embedding_ranks = ranks
-
-        group = torch.distributed.new_group(embedding_ranks)
-        if rank in embedding_ranks:
-            _EMBEDDING_GROUP = group
-        if rank in ranks:
-            _EMBEDDING_GLOBAL_RANKS = embedding_ranks
-
-        group = torch.distributed.new_group(position_embedding_ranks)
-        if rank in position_embedding_ranks:
-            _POSITION_EMBEDDING_GROUP = group
-        if rank in ranks:
-            _POSITION_EMBEDDING_GLOBAL_RANKS = position_embedding_ranks
-
-    # Initialize global memory buffer
-    # This isn't really "parallel state" but there isn't another good place to
-    # put this. If we end up with a more generic initialization of megatron-core
-    # we could stick it there
-    _set_global_memory_buffer()
-
-
-def is_unitialized():
-    """Useful for code segments that may be accessed with or without mpu initialization"""
-    return _DATA_PARALLEL_GROUP is None
-
-
-def model_parallel_is_initialized():
-    """Check if model and data parallel groups are initialized."""
-    if (
-        _TENSOR_MODEL_PARALLEL_GROUP is None
-        or _PIPELINE_MODEL_PARALLEL_GROUP is None
-        or _DATA_PARALLEL_GROUP is None
-    ):
-        return False
-    return True
-
-
-def get_model_parallel_group():
-    """Get the model parallel group the caller rank belongs to."""
-    ensure_var_is_not_none(_MODEL_PARALLEL_GROUP, error_message='model parallel group is not initialized')
-    return _MODEL_PARALLEL_GROUP
-
-
-def get_tensor_model_parallel_group():
-    """Get the tensor model parallel group the caller rank belongs to."""
-    ensure_var_is_not_none(_TENSOR_MODEL_PARALLEL_GROUP, error_message='intra_layer_model' \
-                           ' parallel group is not initialized')
-    return _TENSOR_MODEL_PARALLEL_GROUP
-
-
-def get_pipeline_model_parallel_group():
-    """Get the pipeline model parallel group the caller rank belongs to."""
-    ensure_var_is_not_none(_PIPELINE_MODEL_PARALLEL_GROUP, error_message='pipeline_model' \
-                                                           ' parallel group is not initialized')
-    return _PIPELINE_MODEL_PARALLEL_GROUP
-
-
-def get_data_parallel_group():
-    """Get the data parallel group the caller rank belongs to."""
-    ensure_var_is_not_none(_DATA_PARALLEL_GROUP, error_message='data parallel group is not initialized')
-    return _DATA_PARALLEL_GROUP
-
-
-def get_data_parallel_group_gloo():
-    """Get the data parallel group-gloo the caller rank belongs to."""
-    ensure_var_is_not_none(_DATA_PARALLEL_GROUP_GLOO, error_message='data parallel' \
-                                                      ' group-gloo is not initialized')
-    return _DATA_PARALLEL_GROUP_GLOO
-
-
-def get_embedding_group():
-    """Get the embedding group the caller rank belongs to."""
-    ensure_var_is_not_none(_EMBEDDING_GROUP, error_message='embedding group is not initialized')
-    return _EMBEDDING_GROUP
-
-
-def get_position_embedding_group():
-    """Get the position embedding group the caller rank belongs to."""
-    ensure_var_is_not_none(_POSITION_EMBEDDING_GROUP, error_message='position embedding' \
-                                                      ' group is not initialized')
-    return _POSITION_EMBEDDING_GROUP
-
-
-def set_tensor_model_parallel_world_size(world_size):
-    """Set the tensor model parallel size"""
-    global _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE
-    _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE = world_size
-
-
-def set_pipeline_model_parallel_world_size(world_size):
-    """Set the pipeline model parallel size"""
-    global _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = world_size
-
-
-def get_tensor_model_parallel_world_size():
-    """Return world size for the tensor model parallel group."""
-    global _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE
-    if _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE is not None:
-        return _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE
-    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
-
-
-def get_model_parallel_world_size():
-    check_equal(get_pipeline_model_parallel_world_size(), 1, error_info="legacy get_model_parallel_world_size" \
-                                                             " is only supported if PP is disabled")
-    return get_tensor_model_parallel_world_size()
-
-
-def get_pipeline_model_parallel_world_size():
-    """Return world size for the pipeline model parallel group."""
-    global _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    if _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE is not None:
-        return _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    return torch.distributed.get_world_size(group=get_pipeline_model_parallel_group())
-
-
-def set_tensor_model_parallel_rank(rank):
-    """Set tensor model parallel rank."""
-    global _MPU_TENSOR_MODEL_PARALLEL_RANK
-    _MPU_TENSOR_MODEL_PARALLEL_RANK = rank
-
-
-def set_pipeline_model_parallel_rank(rank):
-    """Set pipeline model parallel rank."""
-    global _MPU_PIPELINE_MODEL_PARALLEL_RANK
-    _MPU_PIPELINE_MODEL_PARALLEL_RANK = rank
-
-
-def set_pipeline_model_parallel_split_rank(rank):
-    """Set pipeline model parallel split rank."""
-    global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-    _PIPELINE_MODEL_PARALLEL_SPLIT_RANK = rank
-
-
-def get_tensor_model_parallel_rank():
-    """Return my rank for the tensor model parallel group."""
-    global _MPU_TENSOR_MODEL_PARALLEL_RANK
-    if _MPU_TENSOR_MODEL_PARALLEL_RANK is not None:
-        return _MPU_TENSOR_MODEL_PARALLEL_RANK
-    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
-
-
-def get_model_parallel_rank():
-    check_equal(get_pipeline_model_parallel_world_size(), 1, error_info="legacy get_model_parallel_rank" \
-                                                             " is only supported if PP is disabled")
-    return get_tensor_model_parallel_rank()
-
-
-def get_pipeline_model_parallel_rank():
-    """Return my rank for the pipeline model parallel group."""
-    global _MPU_PIPELINE_MODEL_PARALLEL_RANK
-    if _MPU_PIPELINE_MODEL_PARALLEL_RANK is not None:
-        return _MPU_PIPELINE_MODEL_PARALLEL_RANK
-    return torch.distributed.get_rank(group=get_pipeline_model_parallel_group())
-
-
-def get_pipeline_model_parallel_split_rank():
-    """Return pipeline model parallel split rank."""
-    global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-    return _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-
-
-def is_pipeline_first_stage(ignore_virtual=False):
-    """Return True if in the first pipeline model-parallel stage, False otherwise."""
-    if not ignore_virtual:
-        if (
-            get_virtual_pipeline_model_parallel_world_size() is not None
-            and get_virtual_pipeline_model_parallel_rank() != 0
-        ):
-            return False
-    return get_pipeline_model_parallel_rank() == 0
-
-
-def is_pipeline_last_stage(ignore_virtual=False):
-    """Return True if in the last pipeline model-parallel stage, False otherwise."""
-    if not ignore_virtual:
-        virtual_pipeline_model_parallel_world_size = (
-            get_virtual_pipeline_model_parallel_world_size()
-        )
-        if virtual_pipeline_model_parallel_world_size is not None \
-            and get_virtual_pipeline_model_parallel_rank() != (
-                virtual_pipeline_model_parallel_world_size - 1):
-            return False
-    return get_pipeline_model_parallel_rank() == (
-            get_pipeline_model_parallel_world_size() - 1)
-
-
-
-def is_rank_in_embedding_group(ignore_virtual=False):
-    """Return true if current rank is in embedding group, False otherwise."""
-    rank = torch.distributed.get_rank()
-    global _EMBEDDING_GLOBAL_RANKS
-    if ignore_virtual:
-        return rank in _EMBEDDING_GLOBAL_RANKS
-    if rank in _EMBEDDING_GLOBAL_RANKS:
-        if rank == _EMBEDDING_GLOBAL_RANKS[0]:
-            return is_pipeline_first_stage(ignore_virtual=False)
-        elif rank == _EMBEDDING_GLOBAL_RANKS[-1]:
-            return is_pipeline_last_stage(ignore_virtual=False)
-        else:
-            return True
-    return False
-
-
-def is_rank_in_position_embedding_group():
-    """Return true if current rank is in position embedding group, False otherwise."""
-    rank = torch.distributed.get_rank()
-    global _POSITION_EMBEDDING_GLOBAL_RANKS
-    return rank in _POSITION_EMBEDDING_GLOBAL_RANKS
-
-
-def is_pipeline_stage_before_split(rank=None):
-    """
-    Return True if pipeline stage executes encoder block for a model
-    with both encoder and decoder.
-    """
-    if get_pipeline_model_parallel_world_size() == 1:
-        return True
-    if rank is None:
-        rank = get_pipeline_model_parallel_rank()
-    global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-    if _PIPELINE_MODEL_PARALLEL_SPLIT_RANK is None:
-        return True
-    if rank < _PIPELINE_MODEL_PARALLEL_SPLIT_RANK:
-        return True
-    return False
-
-
-def is_pipeline_stage_after_split(rank=None):
-    """
-    Return True if pipeline stage executes decoder block for a model
-    with both encoder and decoder.
-    """
-    if get_pipeline_model_parallel_world_size() == 1:
-        return True
-    if rank is None:
-        rank = get_pipeline_model_parallel_rank()
-    global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
-    if _PIPELINE_MODEL_PARALLEL_SPLIT_RANK is None:
-        return True
-    if rank >= _PIPELINE_MODEL_PARALLEL_SPLIT_RANK:
-        return True
-    return False
-
-
-def is_pipeline_stage_at_split():
-    """
-    Return true if pipeline stage executes decoder block and next
-    stage executes encoder block for a model with both encoder and
-    decoder.
-    """
-    rank = get_pipeline_model_parallel_rank()
-    return is_pipeline_stage_before_split(rank) and is_pipeline_stage_after_split(rank + 1)
-
-
-
-def get_virtual_pipeline_model_parallel_rank():
-    """Return the virtual pipeline-parallel rank."""
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
-    return _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
-
-
-def set_virtual_pipeline_model_parallel_rank(rank):
-    """Set the virtual pipeline-parallel rank."""
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
-    _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = rank
-
-
-def get_virtual_pipeline_model_parallel_world_size():
-    """Return the virtual pipeline-parallel world size."""
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    return _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-
-
-def set_virtual_pipeline_model_parallel_world_size(world_size):
-    """Set the virtual pipeline-parallel world size"""
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = world_size
-
-
-def get_tensor_model_parallel_src_rank():
-    """
-    Calculate the global rank corresponding to the first local rank
-    in the tensor model parallel group.
-    """
-    global_rank = torch.distributed.get_rank()
-    local_world_size = get_tensor_model_parallel_world_size()
-    return (global_rank // local_world_size) * local_world_size
-
-
-def get_data_parallel_src_rank():
-    """
-    Calculate the global rank corresponding to the first local rank
-    in the data parallel group.
-    """
-    ensure_var_is_not_none(_DATA_PARALLEL_GLOBAL_RANKS, error_message="Data parallel group is not initialized")
-    return _DATA_PARALLEL_GLOBAL_RANKS[0]
-
-
-def get_pipeline_model_parallel_first_rank():
-    """
-    Return the global rank of the first process in the pipeline for the
-    current tensor parallel group
-    """
-    ensure_var_is_not_none(_PIPELINE_GLOBAL_RANKS, error_message="Pipeline parallel group is not initialized")
-    return _PIPELINE_GLOBAL_RANKS[0]
-
-
-def get_pipeline_model_parallel_last_rank():
-    """
-    Return the global rank of the last process in the pipeline for the
-    current tensor parallel group
-    """
-    ensure_var_is_not_none(_PIPELINE_GLOBAL_RANKS, error_message="Pipeline parallel group is not initialized")
-    last_rank_local = get_pipeline_model_parallel_world_size() - 1
-    return _PIPELINE_GLOBAL_RANKS[last_rank_local]
-
-
-def get_pipeline_model_parallel_next_rank():
-    """Return the global rank that follows the caller in the pipeline"""
-    ensure_var_is_not_none(_PIPELINE_GLOBAL_RANKS, error_message="Pipeline parallel group is not initialized")
-    rank_in_pipeline = get_pipeline_model_parallel_rank()
-    world_size = get_pipeline_model_parallel_world_size()
-    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline + 1) % world_size]
-
-
-def get_pipeline_model_parallel_prev_rank():
-    """Return the global rank that preceeds the caller in the pipeline"""
-    ensure_var_is_not_none(_PIPELINE_GLOBAL_RANKS, error_message="Pipeline parallel group is not initialized")
-    rank_in_pipeline = get_pipeline_model_parallel_rank()
-    world_size = get_pipeline_model_parallel_world_size()
-    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline - 1) % world_size]
-
-
-def get_pipeline_model_parallel_prev_rank_group():
-    ensure_var_is_not_none(_PIPELINE_PREV_GROUP)
-    return _PIPELINE_PREV_GROUP
-
-
-def get_pipeline_model_parallel_next_rank_group():
-    ensure_var_is_not_none(_PIPELINE_NEXT_GROUP)
-    return _PIPELINE_NEXT_GROUP
-
-
-def get_data_parallel_world_size():
-    """Return world size for the data parallel group."""
-    return torch.distributed.get_world_size(group=get_data_parallel_group())
-
-
-def get_data_parallel_rank():
-    """Return my rank for the data parallel group."""
-    return torch.distributed.get_rank(group=get_data_parallel_group())
-
-
-def _set_global_memory_buffer():
-    """Initialize global buffer"""
-    global _GLOBAL_MEMORY_BUFFER
-    ensure_var_is_none(_GLOBAL_MEMORY_BUFFER, error_message='global memory buffer is already initialized')
-    _GLOBAL_MEMORY_BUFFER = GlobalMemoryBuffer()
-
-
-def get_global_memory_buffer():
-    """Return the global GlobalMemoryBuffer object"""
-    ensure_var_is_not_none(_GLOBAL_MEMORY_BUFFER, error_message='global memory buffer is not initialized')
-    return _GLOBAL_MEMORY_BUFFER
-
-
-def destroy_global_memory_buffer():
-    """Sets the global memory buffer to None"""
-    global _GLOBAL_MEMORY_BUFFER
-    _GLOBAL_MEMORY_BUFFER = None
-
-
-def destroy_model_parallel():
-    """Set the groups to none."""
-    global _MODEL_PARALLEL_GROUP
-    _MODEL_PARALLEL_GROUP = None
-    global _TENSOR_MODEL_PARALLEL_GROUP
-    _TENSOR_MODEL_PARALLEL_GROUP = None
-    global _PIPELINE_MODEL_PARALLEL_GROUP
-    _PIPELINE_MODEL_PARALLEL_GROUP = None
-    global _DATA_PARALLEL_GROUP
-    _DATA_PARALLEL_GROUP = None
-    global _PIPELINE_NEXT_GROUP
-    _PIPELINE_NEXT_GROUP = None
-    global _PIPELINE_PREV_GROUP
-    _PIPELINE_PREV_GROUP = None
-    global _EMBEDDING_GROUP
-    _EMBEDDING_GROUP = None
-    global _POSITION_EMBEDDING_GROUP
-    _POSITION_EMBEDDING_GROUP = None
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
-    _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = None
-    global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
-    global _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE
-    _MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE = None
-    global _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
-    _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
-    global _MPU_TENSOR_MODEL_PARALLEL_RANK
-    _MPU_TENSOR_MODEL_PARALLEL_RANK = None
-    global _MPU_PIPELINE_MODEL_PARALLEL_RANK
-    _MPU_PIPELINE_MODEL_PARALLEL_RANK = None
-    global _GLOBAL_MEMORY_BUFFER
-    _GLOBAL_MEMORY_BUFFER = None
-
-
--- a/modellink/core/pipeline_parallel/init.py
+++ b/modellink/core/pipeline_parallel/init.py
@ -1,2 +0,0 @@
-# -*- coding: UTF-8 -*-
-
--- a/modellink/core/pipeline_parallel/p2p_communication.py
+++ b/modellink/core/pipeline_parallel/p2p_communication.py
@ -1,687 +0,0 @@
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
-import operator
-from functools import reduce
-from typing import List, Optional, Tuple, Union
-
-import torch
-from deepspeed.accelerator import get_accelerator
-import torch.distributed as dist
-from modellink import core
-from modellink.core import tensor_parallel
-from modellink import get_args
-from modellink.core import ModelParallelConfig
-from modellink.core.parallel_state import (
-    get_pipeline_model_parallel_group,
-    get_pipeline_model_parallel_next_rank,
-    get_pipeline_model_parallel_prev_rank,
-    get_pipeline_model_parallel_rank,
-    get_tensor_model_parallel_world_size,
-    get_pipeline_model_parallel_prev_rank_group,
-    get_pipeline_model_parallel_next_rank_group
-)
-
-# Types
-Shape = Union[List[int], torch.Size]
-
-
-
-def _communicate(
-        *,
-        tensor_send_next: Optional[torch.Tensor],
-        tensor_send_prev: Optional[torch.Tensor],
-        recv_prev: bool,
-        recv_next: bool,
-        tensor_shape: Shape,
-        config: ModelParallelConfig,
-        wait_on_reqs: bool = True
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """Communicate tensors between stages. Used as helper method in other
-    communication methods that are used in megatron/schedules.py.
-
-    Arguments:
-        tensor_send_next (torch.Tensor, optional):
-            Tensor to send to next rank (no tensor sent if None)
-
-        tensor_send_prev (torch.Tensor, optional):
-            Tensor to send to prev rank (no tensor sent if None)
-
-        recv_prev (boolean, required):
-            whether tensor should be received from previous rank.
-
-        recv_next (boolean, required):
-            whether tensor should be received from next rank.
-
-        tensor_shape (List[int] or torch.Size, required):
-            shape of tensor to receive (this method assumes that all
-            tensors sent and received in a single function call are
-            the same shape).
-
-        wait_on_reqs (boolean, optional, default=False):
-            For non-batched p2p communication, wait on each request
-            before returning.
-
-    Returns:
-        tuple containing
-
-        - tensor_recv_prev: torch.Tensor if recv_prev is True, None otherwise.
-        - tensor_recv_next: torch.Tensor if recv_next is True, None otherwise.
-
-    """
-
-    # Create placeholder tensors for receive in forward and backward directions
-    # if needed.
-    tensor_recv_prev = None
-    tensor_recv_next = None
-    args = get_args()
-
-    tensor_shape = tensor_shape if args.optimized_pipeline and (recv_prev or recv_next) \
-        else (args.seq_length, args.micro_batch_size, args.hidden_size)
-
-    if args.sequence_parallel:
-        seq_length = args.seq_length // get_tensor_model_parallel_world_size()
-        tensor_shape = (seq_length, args.micro_batch_size, args.hidden_size)
-
-    if not config.variable_seq_lengths:
-        recv_prev_shape = tensor_shape
-        recv_next_shape = tensor_shape
-    else:
-        recv_prev_shape, recv_next_shape = _communicate_shapes(
-            tensor_send_next, tensor_send_prev, recv_prev, recv_next, config
-        )
-    recv_prev_shape_origin = recv_prev_shape
-    recv_next_shape_origin = recv_next_shape
-    if args.scatter_gather_tensors_in_pipeline and not config.sequence_parallel:
-        recv_prev_shape = reduce(operator.mul, recv_prev_shape, 1) // \
-            get_tensor_model_parallel_world_size()
-        recv_next_shape = reduce(operator.mul, recv_next_shape, 1) // \
-            get_tensor_model_parallel_world_size()
-
-    if recv_prev:
-        if config.pipeline_dtype is None:
-            raise RuntimeError("pipeline_dtype must be provided if recv_prev is True")
-        if tensor_shape is None:
-            raise RuntimeError(
-                "tensor_shape must be specified if recv_prev is True. "
-                "Common tensor_shape is (seq_length, micro_batch_size, hidden_size)"
-            )
-        tensor_recv_prev = torch.empty(
-            recv_prev_shape,
-            requires_grad=True,
-            device=get_accelerator().current_device(),
-            dtype=config.pipeline_dtype,
-        )
-    if recv_next:
-        if config.pipeline_dtype is None:
-            raise RuntimeError("dtype must be provided if recv_next is True")
-        if tensor_shape is None:
-            raise RuntimeError(
-                "tensor_shape must be specified if recv_next is True. "
-                "Common tensor_shape is (seq_length, micro_batch_size, hidden_size)"
-            )
-        tensor_recv_next = torch.empty(
-            recv_next_shape,
-            requires_grad=True,
-            device=get_accelerator().current_device(),
-            dtype=config.pipeline_dtype,
-        )
-
-    # Split tensor into smaller chunks if using scatter-gather optimization.
-    if args.scatter_gather_tensors_in_pipeline and not config.sequence_parallel:
-        if tensor_send_next is not None:
-            tensor_send_next = tensor_parallel.split_tensor_into_1d_equal_chunks(tensor_send_next)
-
-        if tensor_send_prev is not None:
-            tensor_send_prev = tensor_parallel.split_tensor_into_1d_equal_chunks(tensor_send_prev)
-
-    # Send tensors in both the forward and backward directions as appropriate.
-    if config.use_ring_exchange_p2p:
-        def _ring_exchange_wrapper(**kwargs):
-            torch.distributed.ring_exchange(**kwargs)
-            return []
-
-        p2p_func = _ring_exchange_wrapper
-    elif config.batch_p2p_comm:
-        if not wait_on_reqs:
-            raise Exception("Wait_on_reqs should be true")
-        p2p_func = _batched_p2p_ops
-    else:
-        p2p_func = _p2p_ops
-
-    reqs = p2p_func(
-        tensor_send_prev=tensor_send_prev,
-        tensor_recv_prev=tensor_recv_prev,
-        tensor_send_next=tensor_send_next,
-        tensor_recv_next=tensor_recv_next,
-        group=get_pipeline_model_parallel_group(),
-    )
-
-    if wait_on_reqs and len(reqs) > 0:
-        for req in reqs:
-            req.wait()
-        reqs = None
-
-    # To protect against race condition when using batch_isend_irecv().
-    # User should assert that we have a modern enough PyTorch to not need this
-    get_accelerator().synchronize()
-
-    # If using scatter-gather optimization, gather smaller chunks.
-    if args.scatter_gather_tensors_in_pipeline and not config.sequence_parallel:
-        if recv_prev:
-            tensor_recv_prev = tensor_parallel.gather_split_1d_tensor(
-                tensor_recv_prev).view(recv_prev_shape_origin).requires_grad_()
-
-        if recv_next:
-            tensor_recv_next = tensor_parallel.gather_split_1d_tensor(
-                tensor_recv_next).view(recv_next_shape_origin).requires_grad_()
-
-    return tensor_recv_prev, tensor_recv_next, reqs
-
-
-def async_communicate(tensor_send_next, tensor_send_prev, recv_prev, recv_next):
-    args = get_args()
-
-    # Create placeholder tensors for receive in forward and backward directions
-    # if needed.
-    tensor_recv_prev = None
-    tensor_recv_next = None
-
-    tensor_shape = (args.seq_length, args.micro_batch_size, args.hidden_size)
-
-    if args.sequence_parallel:
-        seq_length = args.seq_length // get_tensor_model_parallel_world_size()
-        tensor_shape = (seq_length, args.micro_batch_size, args.hidden_size)
-
-    if args.scatter_gather_tensors_in_pipeline and not args.sequence_parallel:
-        tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1) // \
-            get_tensor_model_parallel_world_size()
-    else:
-        tensor_chunk_shape = tensor_shape
-    dtype = args.params_dtype
-    if args.fp32_residual_connection:
-        dtype = torch.float
-    if recv_prev:
-        tensor_recv_prev = torch.empty(tensor_chunk_shape,
-                                       requires_grad=True,
-                                       device=get_accelerator().current_device_name(),
-                                       dtype=dtype)
-    if recv_next:
-        tensor_recv_next = torch.empty(tensor_chunk_shape,
-                                       requires_grad=True,
-                                       device=get_accelerator().current_device_name(),
-                                       dtype=dtype)
-
-    # Split tensor into smaller chunks if using scatter-gather optimization.
-    if args.scatter_gather_tensors_in_pipeline and not args.sequence_parallel:
-        if tensor_send_next is not None:
-            tensor_send_next = tensor_parallel.split_tensor_into_1d_equal_chunks(tensor_send_next)
-
-        if tensor_send_prev is not None:
-            tensor_send_prev = tensor_parallel.split_tensor_into_1d_equal_chunks(tensor_send_prev)
-
-    ops = []
-    if tensor_send_prev is not None:
-        torch.distributed.isend(tensor_send_prev,
-            get_pipeline_model_parallel_prev_rank(),
-            group=get_pipeline_model_parallel_prev_rank_group())
-    if tensor_recv_prev is not None:
-        ops.append(torch.distributed.irecv(tensor_recv_prev,
-            get_pipeline_model_parallel_prev_rank(),
-            group=get_pipeline_model_parallel_prev_rank_group()))
-    if tensor_send_next is not None:
-        torch.distributed.isend(tensor_send_next,
-            get_pipeline_model_parallel_next_rank(),
-            group=get_pipeline_model_parallel_next_rank_group())
-    if tensor_recv_next is not None:
-        ops.append(torch.distributed.irecv(tensor_recv_next,
-            get_pipeline_model_parallel_next_rank(),
-            group=get_pipeline_model_parallel_next_rank_group()))
-    return tensor_recv_prev, tensor_recv_next, ops
-
-
-def recv_gather(tensor_recv):
-    args = get_args()
-    tensor_shape = (args.seq_length, args.micro_batch_size, args.hidden_size)
-
-    if args.scatter_gather_tensors_in_pipeline and not args.sequence_parallel:
-        tensor_recv = tensor_parallel.gather_split_1d_tensor(
-            tensor_recv).view(tensor_shape).requires_grad_()
-
-    return tensor_recv
-
-
-def recv_forward(tensor_shape: Shape, config: ModelParallelConfig) -> torch.Tensor:
-    """ Receive tensor from previous rank in pipeline (forward receive).
-
-
-    See _communicate for argument details.
-    """
-
-    if core.parallel_state.is_pipeline_first_stage():
-        input_tensor = None
-    else:
-        if config.timers is not None:
-            config.timers('forward-recv', log_level=2).start()
-        input_tensor, _, _ = _communicate(
-            tensor_send_next=None,
-            tensor_send_prev=None,
-            recv_prev=True,
-            recv_next=False,
-            tensor_shape=tensor_shape,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('forward-recv').stop()
-    return input_tensor
-
-
-def recv_backward(tensor_shape: Shape, config: ModelParallelConfig) -> torch.Tensor:
-    """Receive tensor from next rank in pipeline (backward receive).
-
-    See _communicate for argument details.
-    """
-    if core.parallel_state.is_pipeline_last_stage():
-        output_tensor_grad = None
-    else:
-        if config.timers is not None:
-            config.timers('backward-recv', log_level=2).start()
-        _, output_tensor_grad, _ = _communicate(
-            tensor_send_next=None,
-            tensor_send_prev=None,
-            recv_prev=False,
-            recv_next=True,
-            tensor_shape=tensor_shape,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('backward-recv').stop()
-    return output_tensor_grad
-
-
-def send_forward(output_tensor: torch.Tensor, config: ModelParallelConfig) -> None:
-    """Send tensor to next rank in pipeline (forward send).
-
-    See _communicate for argument details.
-    """
-
-    if not core.parallel_state.is_pipeline_last_stage():
-        if config.timers is not None:
-            config.timers('forward-send', log_level=2).start()
-        _communicate(
-            tensor_send_next=output_tensor,
-            tensor_send_prev=None,
-            recv_prev=False,
-            recv_next=False,
-            tensor_shape=None,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('forward-send').stop()
-
-
-def send_backward(input_tensor_grad: torch.Tensor, config: ModelParallelConfig) -> None:
-    """Send tensor to previous rank in pipeline (backward send).
-
-    See _communicate for argument details.
-    """
-    if not core.parallel_state.is_pipeline_first_stage():
-        if config.timers is not None:
-            config.timers('backward-send', log_level=2).start()
-        _communicate(
-            tensor_send_next=None,
-            tensor_send_prev=input_tensor_grad,
-            recv_prev=False,
-            recv_next=False,
-            tensor_shape=None,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('backward-send').stop()
-
-
-def send_forward_recv_backward(
-        output_tensor: torch.Tensor, tensor_shape: Shape, config: ModelParallelConfig
-) -> torch.Tensor:
-    """Batched send and recv with next rank in pipeline.
-
-    See _communicate for argument details.
-    """
-    if core.parallel_state.is_pipeline_last_stage():
-        output_tensor_grad = None
-    else:
-        if config.timers is not None:
-            config.timers('forward-send-backward-recv', log_level=2).start()
-        _, output_tensor_grad, _ = _communicate(
-            tensor_send_next=output_tensor,
-            tensor_send_prev=None,
-            recv_prev=False,
-            recv_next=True,
-            tensor_shape=tensor_shape,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('forward-send-backward-recv').stop()
-    return output_tensor_grad
-
-
-def send_backward_recv_forward(
-        input_tensor_grad: torch.Tensor, tensor_shape: Shape, config: ModelParallelConfig
-) -> torch.Tensor:
-    """Batched send and recv with previous rank in pipeline.
-
-    See _communicate for argument details.
-    """
-    if core.parallel_state.is_pipeline_first_stage():
-        input_tensor = None
-    else:
-        if config.timers is not None:
-            config.timers('backward-send-forward-recv', log_level=2).start()
-        input_tensor, _, _ = _communicate(
-            tensor_send_next=None,
-            tensor_send_prev=input_tensor_grad,
-            recv_prev=True,
-            recv_next=False,
-            tensor_shape=tensor_shape,
-            config=config,
-        )
-        if config.timers is not None:
-            config.timers('backward-send-forward-recv').stop()
-    return input_tensor
-
-
-def send_forward_recv_forward(
-        output_tensor: torch.Tensor,
-        recv_prev: bool,
-        tensor_shape: Shape,
-        config: ModelParallelConfig,
-        overlap_p2p_comm: bool = False,
-) -> torch.Tensor:
-    """Batched recv from previous rank and send to next rank in pipeline.
-
-    See _communicate for argument details.
-    """
-    if config.timers is not None:
-        config.timers('forward-send-forward-recv', log_level=2).start()
-    input_tensor, _, wait_handles = _communicate(
-        tensor_send_next=output_tensor,
-        tensor_send_prev=None,
-        recv_prev=recv_prev,
-        recv_next=False,
-        tensor_shape=tensor_shape,
-        wait_on_reqs=(not overlap_p2p_comm),
-        config=config,
-    )
-    if config.timers is not None:
-        config.timers('forward-send-forward-recv').stop()
-    if overlap_p2p_comm:
-        return input_tensor, wait_handles
-    return input_tensor
-
-
-def send_backward_recv_backward(
-        input_tensor_grad: torch.Tensor,
-        recv_next: bool,
-        tensor_shape: Shape,
-        config: ModelParallelConfig,
-        overlap_p2p_comm: bool = False,
-) -> torch.Tensor:
-    """Batched recv from next rank and send to previous rank in pipeline.
-
-    See _communicate for argument details.
-    """
-    if config.timers is not None:
-        config.timers('backward-send-backward-recv', log_level=2).start()
-    _, output_tensor_grad, wait_handles = _communicate(
-        tensor_send_next=None,
-        tensor_send_prev=input_tensor_grad,
-        recv_prev=False,
-        recv_next=recv_next,
-        tensor_shape=tensor_shape,
-        wait_on_reqs=(not overlap_p2p_comm),
-        config=config,
-    )
-    if config.timers is not None:
-        config.timers('backward-send-backward-recv').stop()
-    if overlap_p2p_comm:
-        return output_tensor_grad, wait_handles
-    return output_tensor_grad
-
-
-def send_forward_backward_recv_forward_backward(
-        output_tensor: torch.Tensor,
-        input_tensor_grad: torch.Tensor,
-        recv_prev: bool,
-        recv_next: bool,
-        tensor_shape: Shape,
-        config: ModelParallelConfig,
-) -> torch.Tensor:
-    """Batched send and recv with previous and next ranks in pipeline.
-
-    See _communicate for argument details.
-    """
-    if config.timers is not None:
-        config.timers('forward-backward-send-forward-backward-recv', log_level=2).start()
-    input_tensor, output_tensor_grad, _ = _communicate(
-        tensor_send_next=output_tensor,
-        tensor_send_prev=input_tensor_grad,
-        recv_prev=recv_prev,
-        recv_next=recv_next,
-        tensor_shape=tensor_shape,
-        config=config,
-    )
-    if config.timers is not None:
-        config.timers('forward-backward-send-forward-backward-recv').stop()
-    return input_tensor, output_tensor_grad
-
-
-def _communicate_shapes(tensor_send_next, tensor_send_prev, recv_prev, recv_next, config):
-    """Communicate tensor shapes between stages. Used to communicate
-    tensor shapes before the actual tensor communication happens.
-    This is required when the sequence lengths across micro batches
-    are not uniform.
-
-    Takes the following arguments:
-        tensor_send_next: tensor to send to next rank (no tensor sent if
-                          set to None).
-        tensor_send_prev: tensor to send to prev rank (no tensor sent if
-                          set to None).
-        recv_prev: boolean for whether tensor should be received from
-                   previous rank.
-        recv_next: boolean for whether tensor should be received from
-                   next rank.
-    Returns:
-        (recv_prev_shape, recv_next_shape)
-    """
-
-    recv_prev_shape_tensor = None
-    recv_next_shape_tensor = None
-    send_prev_shape_tensor = None
-    send_next_shape_tensor = None
-    if recv_prev:
-        recv_prev_shape_tensor = torch.empty((3),
-                                             device=get_accelerator().current_device(),
-                                             dtype=torch.int64)
-    if recv_next:
-        recv_next_shape_tensor = torch.empty((3),
-                                             device=get_accelerator().current_device(),
-                                             dtype=torch.int64)
-    if tensor_send_prev is not None:
-        send_prev_shape_tensor = torch.tensor(tensor_send_prev.size(),
-                                              device=get_accelerator().current_device(),
-                                              dtype=torch.int64)
-    if tensor_send_next is not None:
-        send_next_shape_tensor = torch.tensor(tensor_send_next.size(),
-                                              device=get_accelerator().current_device(),
-                                              dtype=torch.int64)
-
-    if config.use_ring_exchange_p2p:
-        torch.distributed.ring_exchange(
-            tensor_send_prev=send_prev_shape_tensor,
-            tensor_recv_prev=recv_prev_shape_tensor,
-            tensor_send_next=send_next_shape_tensor,
-            tensor_recv_next=recv_next_shape_tensor,
-            group=get_pipeline_model_parallel_group(),
-        )
-    else:
-        ops = []
-        if send_prev_shape_tensor is not None:
-            send_prev_op = torch.distributed.P2POp(
-                torch.distributed.isend,
-                send_prev_shape_tensor,
-                get_pipeline_model_parallel_prev_rank(),
-            )
-            ops.append(send_prev_op)
-        if recv_prev_shape_tensor is not None:
-            recv_prev_op = torch.distributed.P2POp(
-                torch.distributed.irecv,
-                recv_prev_shape_tensor,
-                get_pipeline_model_parallel_prev_rank(),
-            )
-            ops.append(recv_prev_op)
-        if recv_next_shape_tensor is not None:
-            recv_next_op = torch.distributed.P2POp(
-                torch.distributed.irecv,
-                recv_next_shape_tensor,
-                get_pipeline_model_parallel_next_rank(),
-            )
-            ops.append(recv_next_op)
-        if send_next_shape_tensor is not None:
-            send_next_op = torch.distributed.P2POp(
-                torch.distributed.isend,
-                send_next_shape_tensor,
-                get_pipeline_model_parallel_next_rank(),
-            )
-            ops.append(send_next_op)
-
-        if len(ops) > 0:
-            reqs = torch.distributed.batch_isend_irecv(ops)
-            for req in reqs:
-                req.wait()
-
-        # To protect against race condition when using batch_isend_irecv().
-        # should take this out once the bug with batch_isend_irecv is resolved.
-        get_accelerator().synchronize()
-
-    recv_prev_shape = [0, 0, 0]
-    if recv_prev_shape_tensor is not None:
-        recv_prev_shape = recv_prev_shape_tensor.tolist()
-
-    recv_next_shape = [0, 0, 0]
-    if recv_next_shape_tensor is not None:
-        recv_next_shape = recv_next_shape_tensor.tolist()
-
-    return recv_prev_shape, recv_next_shape
-
-
-def _batched_p2p_ops(
-        *,
-        tensor_send_prev: Optional[torch.Tensor],
-        tensor_recv_prev: Optional[torch.Tensor],
-        tensor_send_next: Optional[torch.Tensor],
-        tensor_recv_next: Optional[torch.Tensor],
-        group: torch.distributed.ProcessGroup
-):
-    ops = []
-    if tensor_send_prev is not None:
-        send_prev_op = torch.distributed.P2POp(
-            torch.distributed.isend,
-            tensor_send_prev,
-            get_pipeline_model_parallel_prev_rank(),
-            group)
-        ops.append(send_prev_op)
-    if tensor_recv_prev is not None:
-        recv_prev_op = torch.distributed.P2POp(
-            torch.distributed.irecv,
-            tensor_recv_prev,
-            get_pipeline_model_parallel_prev_rank(),
-            group,
-        )
-        ops.append(recv_prev_op)
-    if tensor_send_next is not None:
-        send_next_op = torch.distributed.P2POp(
-            torch.distributed.isend,
-            tensor_send_next,
-            get_pipeline_model_parallel_next_rank(),
-            group,
-        )
-        ops.append(send_next_op)
-    if tensor_recv_next is not None:
-        recv_next_op = torch.distributed.P2POp(
-            torch.distributed.irecv,
-            tensor_recv_next,
-            get_pipeline_model_parallel_next_rank(),
-            group,
-        )
-        ops.append(recv_next_op)
-
-    if get_pipeline_model_parallel_rank() % 2 == 1:
-        ops.reverse()
-
-    if len(ops) > 0:
-        reqs = torch.distributed.batch_isend_irecv(ops)
-    else:
-        reqs = []
-    return reqs
-
-
-def _p2p_ops(
-        *,
-        tensor_send_prev: Optional[torch.Tensor],
-        tensor_recv_prev: Optional[torch.Tensor],
-        tensor_send_next: Optional[torch.Tensor],
-        tensor_recv_next: Optional[torch.Tensor],
-        group: torch.distributed.ProcessGroup
-):
-    reqs = []
-    rank = get_pipeline_model_parallel_rank()
-    if get_pipeline_model_parallel_rank() % 2 == 0:
-        if tensor_send_prev is not None:
-            send_prev_req = torch.distributed.isend(
-                tensor=tensor_send_prev, dst=get_pipeline_model_parallel_prev_rank(), group=group,
-            )
-            reqs.append(send_prev_req)
-
-        if tensor_recv_prev is not None:
-            recv_prev_req = torch.distributed.irecv(
-                tensor=tensor_recv_prev, src=get_pipeline_model_parallel_prev_rank(), group=group,
-            )
-            reqs.append(recv_prev_req)
-
-        if tensor_recv_next is not None:
-            recv_next_req = torch.distributed.irecv(
-                tensor=tensor_recv_next, src=get_pipeline_model_parallel_next_rank(), group=group,
-            )
-            reqs.append(recv_next_req)
-
-        if tensor_send_next is not None:
-            send_next_req = torch.distributed.isend(
-                tensor=tensor_send_next, dst=get_pipeline_model_parallel_next_rank(), group=group,
-            )
-            reqs.append(send_next_req)
-
-
-    else:
-        if tensor_send_prev is not None:
-            send_prev_req = torch.distributed.isend(
-                tensor=tensor_send_prev, dst=get_pipeline_model_parallel_prev_rank(), group=group,
-            )
-            reqs.append(send_prev_req)
-
-        if tensor_recv_prev is not None:
-            recv_prev_req = torch.distributed.irecv(
-                tensor=tensor_recv_prev, src=get_pipeline_model_parallel_prev_rank(), group=group,
-            )
-            reqs.append(recv_prev_req)
-
-        if tensor_recv_next is not None:
-            recv_next_req = torch.distributed.irecv(
-                tensor=tensor_recv_next, src=get_pipeline_model_parallel_next_rank(), group=group,
-            )
-            reqs.append(recv_next_req)
-
-        if tensor_send_next is not None:
-            send_next_req = torch.distributed.isend(
-                tensor=tensor_send_next, dst=get_pipeline_model_parallel_next_rank(), group=group,
-            )
-            reqs.append(send_next_req)
-
-    return reqs
--- a/modellink/core/pipeline_parallel/schedules.py
+++ b/modellink/core/pipeline_parallel/schedules.py
--- a/modellink/core/tensor_parallel/init.py
+++ b/modellink/core/tensor_parallel/init.py
@ -1,65 +0,0 @@
-from .cross_entropy import vocab_parallel_cross_entropy
-from .data import broadcast_data
-from .mappings import (
-    copy_to_tensor_model_parallel_region,
-    gather_from_sequence_parallel_region,
-    gather_from_tensor_model_parallel_region,
-    scatter_to_sequence_parallel_region,
-    scatter_to_tensor_model_parallel_region,
-)
-from .layers import (
-    ColumnParallelLinear,
-    RowParallelLinear,
-    VocabParallelEmbedding,
-    copy_tensor_model_parallel_attributes,
-    linear_with_grad_accumulation_and_async_allreduce,
-    param_is_not_tensor_parallel_duplicate,
-    set_defaults_if_not_set_tensor_model_parallel_attributes,
-    set_tensor_model_parallel_attributes,
-)
-from .random import (
-    checkpoint,
-    get_cuda_rng_tracker,
-    model_parallel_cuda_manual_seed,
-    reset_checkpointed_activations_memory_buffer,
-    init_checkpointed_activations_memory_buffer,
-)
-from .utils import (
-    gather_split_1d_tensor,
-    split_tensor_along_last_dim,
-    split_tensor_into_1d_equal_chunks,
-    VocabUtility
-)
-__all__ = [
-    # cross_entropy.py
-    "vocab_parallel_cross_entropy",
-    # data.py
-    "broadcast_data",
-    # layers.py
-    "ColumnParallelLinear",
-    "RowParallelLinear",
-    "VocabParallelEmbedding",
-    "set_tensor_model_parallel_attributes",
-    "set_defaults_if_not_set_tensor_model_parallel_attributes",
-    "copy_tensor_model_parallel_attributes",
-    "param_is_not_tensor_parallel_duplicate",
-    "linear_with_grad_accumulation_and_async_allreduce",
-    # mappings.py
-    "copy_to_tensor_model_parallel_region",
-    "gather_from_tensor_model_parallel_region",
-    "gather_from_sequence_parallel_region",
-    #    "reduce_from_tensor_model_parallel_region",
-    "scatter_to_tensor_model_parallel_region",
-    "scatter_to_sequence_parallel_region",
-    # random.py
-    "checkpoint",
-    "get_cuda_rng_tracker",
-    "model_parallel_cuda_manual_seed",
-    "reset_checkpointed_activations_memory_buffer",
-    "init_checkpointed_activations_memory_buffer",
-    # utils.py
-    "split_tensor_along_last_dim",
-    "split_tensor_into_1d_equal_chunks",
-    "gather_split_1d_tensor",
-    "VocabUtility",
-]
--- a/modellink/core/tensor_parallel/cross_entropy.py
+++ b/modellink/core/tensor_parallel/cross_entropy.py
@ -1,157 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2020, HUAWEI CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import torch
-
-from modellink.core.parallel_state import (
-    get_tensor_model_parallel_group,
-    get_tensor_model_parallel_rank,
-    get_tensor_model_parallel_world_size
-)
-
-from modellink.core.tensor_parallel.utils import VocabUtility
-from modellink.error_utils import ensure_valid
-
-
-class _VocabParallelCrossEntropy(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, vocab_parallel_logits, target, label_smoothing=0.0):
-        # Maximum value along vocab dimension across all GPUs.
-        logits_max = torch.max(vocab_parallel_logits, dim=-1)[0]
-        torch.distributed.all_reduce(logits_max,
-                                     op=torch.distributed.ReduceOp.MAX,
-                                     group=get_tensor_model_parallel_group())
-
-        # Subtract the maximum value.
-        vocab_parallel_logits.sub_(logits_max.unsqueeze(dim=-1))
-
-        # Get the partition's vocab indecies
-        get_vocab_range = VocabUtility.vocab_range_from_per_partition_vocab_size
-        partition_vocab_size = vocab_parallel_logits.size()[-1]
-        rank = get_tensor_model_parallel_rank()
-        world_size = get_tensor_model_parallel_world_size()
-        vocab_start_index, vocab_end_index = get_vocab_range(
-            partition_vocab_size, rank, world_size)
-
-        # Create a mask of valid vocab ids (1 means it needs to be masked).
-        target_mask = (target < vocab_start_index) | (target >= vocab_end_index)
-        masked_target = target.clone() - vocab_start_index
-        masked_target *= ~target_mask
-
-        # Get predicted-logits = logits[target].
-        # For Simplicity, we convert logits to a 2-D tensor with size
-        # [*, partition-vocab-size] and target to a 1-D tensor of size [*].
-        logits_2d = vocab_parallel_logits.view(-1, partition_vocab_size)
-        masked_target_1d = masked_target.view(-1)
-        arange_1d = torch.arange(start=0, end=logits_2d.size()[0],
-                                 device=logits_2d.device)
-        predicted_logits_1d = logits_2d[arange_1d, masked_target_1d.long()]
-        predicted_logits_1d = predicted_logits_1d.clone().contiguous()
-        predicted_logits = predicted_logits_1d.view_as(target)
-        predicted_logits *= ~target_mask
-        # All reduce is needed to get the chunks from other GPUs.
-        torch.distributed.all_reduce(predicted_logits,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=get_tensor_model_parallel_group())
-
-        # Sum of exponential of logits along vocab dimension across all GPUs.
-        exp_logits = vocab_parallel_logits
-        torch.exp(vocab_parallel_logits, out=exp_logits)
-        sum_exp_logits = exp_logits.sum(dim=-1)
-        torch.distributed.all_reduce(sum_exp_logits,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=get_tensor_model_parallel_group())
-
-        # Loss = log(sum(exp(logits))) - predicted-logit.
-        loss = torch.log(sum_exp_logits) - predicted_logits
-
-        # Store softmax, target-mask and masked-target for backward pass.
-        exp_logits.div_(sum_exp_logits.unsqueeze(dim=-1))
-
-        vocab_size = exp_logits.size(-1)
-        if label_smoothing > 0:
-            """
-            We'd like to assign 1 / (K - 1) probability mass to every index that is not the ground truth.
-            = (1 - alpha) * y_gt + alpha * mean(y_{i for i != gt})
-            = (1 - alpha) * y_gt + (alpha / (K - 1)) * \sum_{i != gt} y_i
-            = ((K - 1) * (1 - alpha) / (K - 1)) * y_gt + (alpha / (K - 1)) * \sum_{i != gt} y_i
-            = (K * (1 - alpha) - 1) / (K - 1)) * y_gt  + (alpha / (K - 1)) * \sum_{i} y_i
-            = (1 - (alpha * K) / (K - 1)) * y_gt + ( (alpha * K) / (K - 1) ) * \sum_{i} y_i / K
-            """
-            ensure_valid(1.0 > label_smoothing > 0.0)
-            smoothing = label_smoothing * vocab_size / (vocab_size - 1)
-
-            # Exp logits at this point are normalized probabilities. So we can just take the log to get log-probs.
-            log_probs = torch.log(exp_logits)
-            mean_log_probs = log_probs.mean(dim=-1)
-            loss = (1.0 - smoothing) * loss - smoothing * mean_log_probs
-
-        ctx.label_smoothing, ctx.vocab_size = label_smoothing, vocab_size
-
-        # Store softmax, target-mask and masked-target for backward pass.
-        ctx.save_for_backward(exp_logits, target_mask, masked_target_1d)
-
-        return loss
-
-    @staticmethod
-    def backward(ctx, grad_output):
-
-        # Retreive tensors from the forward path.
-        softmax, target_mask, masked_target_1d = ctx.saved_tensors
-        label_smoothing, vocab_size = ctx.label_smoothing, ctx.vocab_size
-
-        # All the inputs have softmax as thier gradient.
-        grad_input = softmax
-        # For simplicity, work with the 2D gradient.
-        partition_vocab_size = softmax.size()[-1]
-        grad_2d = grad_input.view(-1, partition_vocab_size)
-
-        # Add the gradient from matching classes.
-        arange_1d = torch.arange(start=0, end=grad_2d.size()[0],
-                                 device=grad_2d.device).long()
-        masked_target_1d = masked_target_1d.long()
-
-        softmax_update = 1.0 - target_mask.view(-1).float()
-
-        if label_smoothing > 0:
-            smoothing = label_smoothing * vocab_size / (vocab_size - 1)
-            grad_2d[arange_1d, masked_target_1d] -= (1.0 - smoothing) * softmax_update
-            average_grad = 1 / vocab_size
-            grad_2d[arange_1d, :] -= smoothing * average_grad
-        else:
-            grad_2d[arange_1d, masked_target_1d] -= softmax_update
-
-        # Finally elementwise multiplication with the output gradients.
-        grad_input.mul_(grad_output.unsqueeze(dim=-1))
-
-        return grad_input, None, None
-
-
-def vocab_parallel_cross_entropy(vocab_parallel_logits, target, label_smoothing=0.0):
-    """
-    Performs cross entropy loss when logits are split across tensor parallel ranks
-
-    Arguments:
-        vocab_parallel_logits: logits split across tensor parallel ranks
-                               dimension is [sequence_length, batch_size, hidden_size]
-
-        target: correct vocab ids of dimseion [sequence_length, micro_batch_size]
-
-        lobal_smoothing: smoothing factor, must be in range [0.0, 1.0)
-                         default is no smoothing (=0.0)
-    """
-    return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing)
--- a/modellink/core/tensor_parallel/data.py
+++ b/modellink/core/tensor_parallel/data.py
@ -1,121 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-
-from modellink.core.parallel_state import (
-    get_tensor_model_parallel_group,
-    get_tensor_model_parallel_rank,
-    get_tensor_model_parallel_src_rank,
-)
-from modellink.error_utils import check_exist
-
-_MAX_DATA_DIM = 5
-
-
-def _check_data_types(keys, data, target_dtype):
-    """Check that all the keys have the same target data type."""
-    for key in keys:
-        if data[key].dtype != target_dtype:
-            raise Exception('{} has data type {} which is different than {}'.format(key, data[key].dtype, target_dtype))
-
-
-def _build_key_size_numel_dictionaries(keys, data):
-    """Build the size on rank 0 and broadcast."""
-    max_dim = _MAX_DATA_DIM
-    sizes = [0 for _ in range(max_dim) for _ in keys]
-
-    # Pack the sizes on rank zero.
-    if get_tensor_model_parallel_rank() == 0:
-        offset = 0
-        for key in keys:
-            if data[key].dim() >= max_dim:
-                raise Exception('you should increase MAX_DATA_DIM.')
-            size = data[key].size()
-            for i, s in enumerate(size):
-                sizes[i + offset] = s
-            offset += max_dim
-
-    # Move to GPU and broadcast.
-    sizes_cuda = torch.LongTensor(sizes).cuda()
-    torch.distributed.broadcast(sizes_cuda, get_tensor_model_parallel_src_rank(),
-                                group=get_tensor_model_parallel_group())
-
-    # Move back to cpu and unpack.
-    sizes_cpu = sizes_cuda.cpu()
-    key_size = {}
-    key_numel = {}
-    total_numel = 0
-    offset = 0
-    for key in keys:
-        i = 0
-        size = []
-        numel = 1
-        while sizes_cpu[offset + i] > 0:
-            this_size = sizes_cpu[offset + i]
-            size.append(this_size)
-            numel *= this_size
-            i += 1
-        key_size[key] = size
-        key_numel[key] = numel
-        total_numel += numel
-        offset += max_dim
-
-    return key_size, key_numel, total_numel
-
-
-def broadcast_data(keys, data, datatype):
-    """Broadcast data from rank zero of each model parallel group to the
-    members of the same model parallel group.
-
-    Arguments:
-        keys: list of keys in the data disctionary to be broadcasted
-        data: data dictionary of string keys and cpu tensor values.
-        datatype: torch data type of all tensors in data associated
-                  with keys.
-    """
-    # Build (key, size) and (key, number of elements) dictionaries along
-    # with the total number of elements on all ranks.
-    key_size, key_numel, total_numel = _build_key_size_numel_dictionaries(keys,
-                                                                          data)
-
-    # Pack on rank zero.
-    if get_tensor_model_parallel_rank() == 0:
-        # Check that all keys have the same data type.
-        _check_data_types(keys, data, datatype)
-        # Flatten the data associated with the keys
-        flatten_data = torch.cat(
-            [data[key].contiguous().view(-1) for key in keys], dim=0).to(torch.cuda.current_device())
-    else:
-        flatten_data = torch.empty(total_numel,
-                                   device=torch.cuda.current_device(),
-                                   dtype=datatype)
-
-    # Broadcast
-    torch.distributed.broadcast(flatten_data, get_tensor_model_parallel_src_rank(),
-                                group=get_tensor_model_parallel_group())
-
-    # Unpack
-    output = {}
-    offset = 0
-    for key in keys:
-        if check_exist(key, key_size):
-            size = key_size[key]
-        if check_exist(key, key_numel):
-            numel = key_numel[key]
-        output[key] = flatten_data.narrow(0, offset, numel).view(size)
-        offset += numel
-
-    return output
--- a/modellink/core/tensor_parallel/layers.py
+++ b/modellink/core/tensor_parallel/layers.py
@ -1,878 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-
-
-import os
-from typing import Optional, Callable
-import warnings
-
-import torch
-import torch.nn.functional as F
-import torch.nn.init as init
-from torch.cuda.amp import custom_fwd, custom_bwd
-from torch.nn.parameter import Parameter
-
-from modellink import get_args
-from modellink.error_utils import ensure_valid
-from modellink.core import parallel_state
-from modellink.core.parallel_state import get_tensor_model_parallel_rank
-from modellink.core.parallel_state import get_tensor_model_parallel_group
-from modellink.core.parallel_state import get_tensor_model_parallel_world_size
-from modellink.core.parallel_state import get_global_memory_buffer
-from modellink.core.model_parallel_config import ModelParallelConfig
-from modellink.core.tensor_parallel.mappings import copy_to_tensor_model_parallel_region
-from modellink.core.tensor_parallel.mappings import gather_from_tensor_model_parallel_region
-from modellink.core.tensor_parallel.mappings import reduce_from_tensor_model_parallel_region
-from modellink.core.tensor_parallel.mappings import scatter_to_tensor_model_parallel_region
-from modellink.core.tensor_parallel.mappings import reduce_scatter_to_sequence_parallel_region
-# Do not delete the following line of code
-from modellink.core.tensor_parallel.random import get_cuda_rng_tracker
-from modellink.core.utils import divide
-from modellink.core.tensor_parallel.utils import VocabUtility
-from modellink.model.fused_layer_norm import MixedFusedLayerNorm as LayerNorm
-from modellink.arguments import core_transformer_config_from_args
-
-
-_grad_accum_fusion_available = True
-try:
-    import fused_weight_gradient_mlp_cuda
-except ImportError:
-    _grad_accum_fusion_available = False
-
-_MODEL_PARALLEL_ATTRIBUTE_DEFAULTS = {
-    'tensor_model_parallel': False,
-    'partition_dim': -1,
-    'partition_stride': 1
-}
-
-
-def param_is_not_tensor_parallel_duplicate(param):
-    return (hasattr(param, 'tensor_model_parallel') and
-            param.tensor_model_parallel) or (
-                get_tensor_model_parallel_rank() == 0)
-
-
-def set_tensor_model_parallel_attributes(tensor, is_parallel, dim, stride):
-    # Make sure the attributes are not set.
-    for attribute in _MODEL_PARALLEL_ATTRIBUTE_DEFAULTS:
-        if hasattr(tensor, attribute):
-            raise Exception("Make sure the attributes are not set.")
-    # Set the attributes.
-    setattr(tensor, 'tensor_model_parallel', is_parallel)
-    setattr(tensor, 'partition_dim', dim)
-    setattr(tensor, 'partition_stride', stride)
-
-
-def set_defaults_if_not_set_tensor_model_parallel_attributes(tensor):
-    def maybe_set(attribute, value):
-        if not hasattr(tensor, attribute):
-            setattr(tensor, attribute, value)
-    for attribute in _MODEL_PARALLEL_ATTRIBUTE_DEFAULTS:
-        maybe_set(attribute, _MODEL_PARALLEL_ATTRIBUTE_DEFAULTS[attribute])
-
-
-def copy_tensor_model_parallel_attributes(destination_tensor, source_tensor):
-    def maybe_copy(attribute):
-        if hasattr(source_tensor, attribute):
-            setattr(destination_tensor, attribute,
-                    getattr(source_tensor, attribute))
-    for attribute in _MODEL_PARALLEL_ATTRIBUTE_DEFAULTS:
-        maybe_copy(attribute)
-
-
-def _initialize_affine_weight_gpu(weight, init_method,
-                                  partition_dim, stride=1):
-    """Initialize affine weight for model parallel on GPU."""
-
-    set_tensor_model_parallel_attributes(tensor=weight,
-                                         is_parallel=True,
-                                         dim=partition_dim,
-                                         stride=stride)
-
-    with get_cuda_rng_tracker().fork():
-        init_method(weight)
-
-
-def _initialize_affine_weight_cpu(weight, output_size, input_size,
-                                  per_partition_size, partition_dim,
-                                  init_method, stride=1,
-                                  return_master_weight=False, *, params_dtype=torch.float32):
-    """Initialize affine weight for model parallel.
-
-    Build the master weight on all processes and scatter
-    the relevant chunk."""
-
-    set_tensor_model_parallel_attributes(tensor=weight,
-                                         is_parallel=True,
-                                         dim=partition_dim,
-                                         stride=stride)
-
-    # Initialize master weight
-    master_weight = torch.empty(output_size, input_size,
-                                dtype=torch.float,
-                                requires_grad=False)
-    init_method(master_weight)
-    master_weight = master_weight.to(dtype=params_dtype)
-
-    # Split and copy
-    per_partition_per_stride_size = divide(per_partition_size, stride)
-    weight_list = torch.split(master_weight, per_partition_per_stride_size,
-                              dim=partition_dim)
-    rank = get_tensor_model_parallel_rank()
-    world_size = get_tensor_model_parallel_world_size()
-    my_weight_list = weight_list[rank::world_size]
-
-    with torch.no_grad():
-        torch.cat(my_weight_list, dim=partition_dim, out=weight)
-    if return_master_weight:
-        return master_weight
-    return None
-
-
-class VocabParallelEmbedding(torch.nn.Module):
-    """Embedding parallelized in the vocabulary dimension.
-
-    This is mainly adapted from torch.nn.Embedding and all the default
-    values are kept.
-    Arguments:
-        num_embeddings: vocabulary size.
-        embedding_dim: size of hidden state.
-        init_method: method to initialize weights.
-    """
-
-    def __init__(
-        self,
-        num_embeddings: int,
-        embedding_dim: int,
-        *,
-        init_method: Callable,
-        config: ModelParallelConfig,
-    ):
-        super(VocabParallelEmbedding, self).__init__()
-        # Keep the input dimensions.
-        self.num_embeddings = num_embeddings
-        self.embedding_dim = embedding_dim
-        # Set the detauls for compatibility.
-        self.padding_idx = None
-        self.max_norm = None
-        self.norm_type = 2.0
-        self.scale_grad_by_freq = False
-        self.sparse = False
-        self._weight = None
-        self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
-        # Divide the weight matrix along the vocaburaly dimension.
-        self.vocab_start_index, self.vocab_end_index = \
-            VocabUtility.vocab_range_from_global_vocab_size(
-                self.num_embeddings, get_tensor_model_parallel_rank(),
-                self.tensor_model_parallel_size)
-        self.num_embeddings_per_partition = self.vocab_end_index - \
-            self.vocab_start_index
-        args = get_args()
-        if parallel_state.is_pipeline_first_stage() and args.embed_layernorm:
-            self.norm = LayerNorm(embedding_dim)
-        # Allocate weights and initialize.
-        if config.use_cpu_initialization:
-            self.weight = Parameter(
-                torch.empty(
-                    self.num_embeddings_per_partition, self.embedding_dim, dtype=config.params_dtype
-                )
-            )
-            if config.perform_initialization:
-                _initialize_affine_weight_cpu(
-                    self.weight,
-                    self.num_embeddings,
-                    self.embedding_dim,
-                    self.num_embeddings_per_partition,
-                    0,
-                    init_method,
-                    params_dtype=config.params_dtype,
-                )
-        else:
-            self.weight = Parameter(
-                torch.empty(
-                    self.num_embeddings_per_partition,
-                    self.embedding_dim,
-                    device=torch.cuda.current_device_name(),
-                    dtype=config.params_dtype,
-                )
-            )
-
-            if config.perform_initialization:
-                _initialize_affine_weight_gpu(self.weight, init_method, partition_dim=0, stride=1)
-
-
-    def forward(self, input_):
-        if self.tensor_model_parallel_size > 1:
-            # Build the mask.
-            input_mask = (input_ < self.vocab_start_index) | \
-                         (input_ >= self.vocab_end_index)
-            # Mask the input.
-            masked_input = input_.clone() - self.vocab_start_index
-            input_mask = ~input_mask
-            masked_input *= input_mask.long()
-        else:
-            masked_input = input_
-            # Get the embeddings.
-        output_parallel = F.embedding(masked_input, self.weight,
-                                      self.padding_idx, self.max_norm,
-                                      self.norm_type, self.scale_grad_by_freq,
-                                      self.sparse)
-        # Mask the output embedding.
-        if self.tensor_model_parallel_size > 1:
-            output_parallel *= input_mask[..., None].to(output_parallel.dtype)
-        # Reduce across all the model parallel GPUs.
-        output = reduce_from_tensor_model_parallel_region(output_parallel)
-        if hasattr(self, 'norm'):
-            output = self.norm(output)
-
-        return output
-
-
-class SequenceParallelPositionEmbedding(torch.nn.Module):
-    """Embedding parallelized in the sequence dimension.
-
-    Arguments:
-        sequence_length: max sequence length.
-        embedding_dim: size of hidden state.
-    """
-
-    def __init__(self, sequence_length, embedding_dim):
-        super(SequenceParallelPositionEmbedding, self).__init__()
-        sequence_parallel_size = get_tensor_model_parallel_world_size()
-        ensure_valid(sequence_length % sequence_parallel_size == 0)
-        local_sequence_length = sequence_length // sequence_parallel_size
-        self.offset = local_sequence_length * get_tensor_model_parallel_rank()
-        self.local_embeddings = torch.nn.Embedding(
-            local_sequence_length, embedding_dim)
-
-    def forward(self, position_ids):
-        return self.local_embeddings(position_ids - self.offset)
-
-
-class LinearWithGradAccumulationAndAsyncCommunication(torch.autograd.Function):
-    """See linear_with_grad_accumulation_and_async_allreduce"""
-
-    @staticmethod
-    @custom_fwd
-    def forward(
-        ctx,
-        inputs,
-        weight,
-        bias,
-        gradient_accumulation_fusion,
-        async_grad_allreduce,
-        sequence_parallel,
-    ):
-        ctx.save_for_backward(inputs, weight)
-        ctx.use_bias = bias is not None
-        ctx.gradient_accumulation_fusion = gradient_accumulation_fusion
-        ctx.async_grad_allreduce = async_grad_allreduce
-        ctx.sequence_parallel = sequence_parallel
-
-        if sequence_parallel:
-            world_size = get_tensor_model_parallel_world_size()
-            dim_size = list(inputs.size())
-            dim_size[0] = dim_size[0] * world_size
-
-            all_gather_buffer = get_global_memory_buffer().get_tensor(dim_size, inputs.dtype, "mpu")
-            torch.distributed._all_gather_base(
-                all_gather_buffer, inputs, group=get_tensor_model_parallel_group()
-            )
-            total_input = all_gather_buffer
-        else:
-            total_input = inputs
-        output = torch.matmul(total_input, weight.t())
-        if bias is not None:
-            output = output + bias
-        return output
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output):
-        inputs, weight = ctx.saved_tensors
-        use_bias = ctx.use_bias
-
-        if ctx.sequence_parallel:
-            world_size = get_tensor_model_parallel_world_size()
-            dim_size = list(inputs.size())
-            dim_size[0] = dim_size[0] * world_size
-
-            all_gather_buffer = get_global_memory_buffer().get_tensor(dim_size, inputs.dtype, "mpu")
-            handle = torch.distributed._all_gather_base(
-                all_gather_buffer, inputs, group=get_tensor_model_parallel_group(), async_op=True
-            )
-
-            # Here we rely on CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure that the
-            # gather is scheduled before the input gradient computation
-            total_input = all_gather_buffer
-        else:
-            total_input = inputs
-        grad_input = grad_output.matmul(weight)
-
-        if ctx.sequence_parallel:
-            handle.wait()
-
-        # Doing gather + slicing during the NeMo forward pass can make this tensor not be contiguous.
-        # PyTorch only checks if the tensor is contiguous.
-        grad_output = grad_output.contiguous()
-        # Convert the tensor shapes to 2D for execution compatibility
-        grad_output = grad_output.view(
-            grad_output.shape[0] * grad_output.shape[1], grad_output.shape[2]
-        )
-        total_input = total_input.view(
-            total_input.shape[0] * total_input.shape[1], total_input.shape[2]
-        )
-
-        if ctx.async_grad_allreduce:
-            # Asynchronous all-reduce
-            handle = torch.distributed.all_reduce(
-                grad_input, group=get_tensor_model_parallel_group(), async_op=True
-            )
-            # Here we rely on CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure that the
-            # all-reduce is scheduled before the weight gradient computation
-
-        if ctx.sequence_parallel:
-            if ctx.async_grad_allreduce:
-                raise Exception("async_grad_allreduce must be False")
-            dim_size = list(inputs.size())
-            sub_grad_input = torch.empty(
-                dim_size, dtype=inputs.dtype, device=torch.cuda.current_device(), requires_grad=False
-            )
-            # reduce_scatter
-            handle = torch.distributed._reduce_scatter_base(
-                sub_grad_input, grad_input, group=get_tensor_model_parallel_group(), async_op=True
-            )
-            # Here we rely on CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure that the
-            # reduce scatter is scheduled before the weight gradient computation
-
-        if ctx.gradient_accumulation_fusion:
-            if hasattr(weight, 'main_grad'):
-                if weight.main_grad.dtype == torch.float32:
-                    fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp32(
-                        total_input, grad_output, weight.main_grad
-                    )
-                elif weight.main_grad.dtype in (torch.float16, torch.bfloat16):
-                    fused_weight_gradient_mlp_cuda.wgrad_gemm_accum_fp16(
-                        total_input, grad_output, weight.main_grad
-                    )
-                else:
-                    raise RuntimeError("Unsupported gradient type for gradient accumulation fusion")
-            grad_weight = None
-        else:
-            grad_weight = grad_output.t().matmul(total_input)
-        grad_bias = grad_output.sum(dim=0) if use_bias else None
-
-        if ctx.sequence_parallel:
-            handle.wait()
-            return sub_grad_input, grad_weight, grad_bias, None, None, None
-
-        if ctx.async_grad_allreduce:
-            handle.wait()
-
-        return grad_input, grad_weight, grad_bias, None, None, None
-
-
-def linear_with_grad_accumulation_and_async_allreduce(
-    inputs: torch.Tensor,
-    weight: torch.Tensor,
-    bias: Optional[torch.Tensor],
-    gradient_accumulation_fusion: bool,
-    async_grad_allreduce: bool,
-    sequence_parallel: bool,
-) -> torch.Tensor:
-    """Linear layer execution with asynchronous communication and
-    gradient accumulation fusion in backprop.
-
-    This has the option to accumulate the result of backprop
-    calculation into an existing gradient buffer, preventing the need
-    to do an additional addition kernel after the gradient
-    calculation.
-
-    Additionally, the tensor parallel all reduce of the input
-    gradients can be done asynchronously with the calculation of
-    the weight gradients.
-
-    In the case of sequence parallelism, the reduce scatter of the
-    input gradients is done asynchronously with the calcluation of the
-    weight gradients.
-
-    Use of this module requires that the environment variable
-    CUDA_DEVICE_MAX_CONNECTIONS=1. There are a few collective
-    operations, noted in the code, that should be scheduled before
-    compute kernels to overlap the communication with the computation,
-    which is necessary for a speedup but not for correctness so that
-    ordering isn't imposed by the scheduler. Setting
-    CUDA_DEVICE_MAX_CONNECTIONS=1 forces the kernels to be scheduled
-    in the order they are called.
-
-    Arguments:
-
-    input (torch.Tensor required): input like torch.nn.functional.linear
-
-    weight (torch.Tensor required): weight like torch.nn.functional.linear
-
-    bias (torch.Tensor optional): bias like torch.nn.functional.linear
-
-    gradient_accumulation_fusion (bool required): Perform the gradient
-        accumulation fusion, requires the custom CUDA extension
-        fused_weight_gradient_mlp_cuda module. To use
-        gradient_accumulation_fusion you must install APEX with
-        --cpp_ext and --cuda_ext. For example: "pip install
-        --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext .\"
-        " Note that the extension requires CUDA>=11. Otherwise, you
-        must turn off gradient accumulation fusion."
-
-    async_grad_allreduce (bool required): Do the allreduce of input
-        gradients asyncronously with the computation of weight
-        gradients. If sequence_parallel is True, this must be
-        False, as no all reduce is performed.
-
-    sequence_parallel (bool required): Indicates that sequence
-        parallelism is used and thus in the forward pass the input is
-        all gathered, and the backward pass the input gradients are
-        reduce scattered.
-    """
-    args = [
-        inputs,
-        weight,
-        bias,
-        gradient_accumulation_fusion,
-        async_grad_allreduce,
-        sequence_parallel,
-    ]
-
-    if not linear_with_grad_accumulation_and_async_allreduce.warned:
-        if os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1":
-            if sequence_parallel:
-                warnings.warn(
-                    "When using sequence parallelism it is recommended to set the "
-                    "environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1 for "
-                    "maximum speedup"
-                )
-                linear_with_grad_accumulation_and_async_allreduce.warned = True
-
-            if async_grad_allreduce:
-                warnings.warn(
-                    "When using async grad allreduce it is recommended to set the "
-                    "environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1 for "
-                    "maximum speedup"
-                )
-                linear_with_grad_accumulation_and_async_allreduce.warned = True
-
-    return LinearWithGradAccumulationAndAsyncCommunication.apply(*args)
-
-
-linear_with_grad_accumulation_and_async_allreduce.warned = False
-
-
-class ColumnParallelLinear(torch.nn.Module):
-    """Linear layer with column parallelism.
-
-    The linear layer is defined as Y = XA + b. A is parallelized along
-    its second dimension as A = [A_1, ..., A_p].
-
-    Arguments:
-        input_size: first dimension of matrix A.
-        output_size: second dimension of matrix A.
-
-    Keyword Arguments
-        bias: If true, add bias
-        gather_output: If true, call all-gather on output and make Y available
-                       to all GPUs, otherwise, every GPU will have its output
-                       which is Y_i = XA_i
-        init_method: method to initialize weights. Note that bias is always set
-                     to zero.
-        stride: For the strided linear layers.
-        keep_master_weight_for_test: This was added for testing and should be
-                                     set to False. It returns the master weights
-                                     used for initialization.
-        skip_bias_add: If True, do not add the bias term, instead
-                       return it to be added by the caller. This
-                       enables performance optimations where bias can
-                       be fused with other elementwise operations.
-
-        skip_weight_param_allocation: If True, weight parameter is not allocated and must be passed
-                                      as a keyword argument `weight` during the forward pass. Note
-                                      that this does not affect bias, which will be allocated if
-                                      bias is True. Defaults to False.
-
-        config: ModelParallelConfig object
-
-    """
-
-    def __init__(
-            self,
-            input_size,
-            output_size,
-            *,
-            config: ModelParallelConfig = None,
-            init_method: Callable = init.xavier_normal_,
-            bias=True,
-            gather_output=False,
-            stride=1,
-            keep_master_weight_for_test=False,
-            skip_bias_add=False,
-            skip_weight_param_allocation: bool = False,
-            moe=False, enable_expert_tensor_parallelism=False,
-            dtype=None
-    ):
-        torch.nn.Module.__init__(self)
-        # Keep input parameters
-        self.input_size = input_size
-        self.output_size = output_size
-        self.gather_output = gather_output
-        # Divide the weight matrix along the last dimension.
-        if moe and (not enable_expert_tensor_parallelism):
-            world_size = 1
-            self.is_expert_without_slicing = True
-        else:
-            world_size = get_tensor_model_parallel_world_size()
-            self.is_expert_without_slicing = False
-
-        self.output_size_per_partition = divide(output_size, world_size)
-        self.skip_bias_add = skip_bias_add
-        if config is None:
-            config = core_transformer_config_from_args(get_args())
-        self.config = config
-        dtype = config.params_dtype if dtype is None else dtype
-        # Parameters.
-        # Note: torch.nn.functional.linear performs XA^T + b and as a result
-        # we allocate the transpose.
-        # Initialize weight.
-        if not skip_weight_param_allocation:
-            if config.use_cpu_initialization:
-                self.weight = Parameter(
-                    torch.empty(
-                        self.output_size_per_partition, self.input_size, dtype=dtype
-                    )
-                )
-                if config.perform_initialization:
-                    self.master_weight = _initialize_affine_weight_cpu(
-                        self.weight,
-                        self.output_size,
-                        self.input_size,
-                        self.output_size_per_partition,
-                        0,
-                        init_method,
-                        stride=stride,
-                        return_master_weight=keep_master_weight_for_test,
-                    )
-            else:
-                self.weight = Parameter(
-                    torch.empty(
-                        self.output_size_per_partition,
-                        self.input_size,
-                        device=torch.cuda.current_device(),
-                        dtype=dtype,
-                    )
-                )
-                if config.perform_initialization:
-                    _initialize_affine_weight_gpu(
-                        self.weight, init_method, partition_dim=0, stride=stride
-                    )
-        else:
-            self.weight = None
-
-        if bias:
-            if config.use_cpu_initialization:
-                self.bias = Parameter(
-                    torch.empty(self.output_size_per_partition, dtype=dtype)
-                )
-            else:
-                self.bias = Parameter(
-                    torch.empty(
-                        self.output_size_per_partition,
-                        device=torch.cuda.current_device(),
-                        dtype=dtype,
-                    )
-                )
-            set_tensor_model_parallel_attributes(self.bias, True, 0, stride)
-            if config.perform_initialization:
-                # Always initialize bias to zero.
-                with torch.no_grad():
-                    self.bias.zero_()
-        else:
-            self.register_parameter('bias', None)
-
-        self.async_tensor_model_parallel_allreduce = (
-            config.async_tensor_model_parallel_allreduce and world_size > 1
-        )
-
-        self.sequence_parallel = config.sequence_parallel
-        if self.sequence_parallel and world_size <= 1:
-            warnings.warn(
-                f"`sequence_parallel` is set to `True`, but tensor model parallel size is {world_size}. "
-                f"Disabling sequence parallel."
-            )
-            self.sequence_parallel = False
-
-        if config.gradient_accumulation_fusion and not _grad_accum_fusion_available:
-            raise RuntimeError(
-                "ColumnParallelLinear was called with gradient_accumulation_fusion set "
-                "to True but the custom CUDA extension fused_weight_gradient_mlp_cuda "
-                "module is not found. To use gradient_accumulation_fusion you must "
-                "install APEX with --cpp_ext and --cuda_ext. For example: "
-                "pip install --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext .\" "
-                "Note that the extension requires CUDA>=11. Otherwise, you must turn off "
-                "gradient accumulation fusion."
-            )
-        self.gradient_accumulation_fusion = config.gradient_accumulation_fusion
-
-        if self.async_tensor_model_parallel_allreduce and self.sequence_parallel:
-            raise RuntimeError(
-                "`async_tensor_model_parallel_allreduce` and `sequence_parallel` "
-                "cannot be enabled at the same time."
-            )
-
-        self._forward_impl = linear_with_grad_accumulation_and_async_allreduce
-
-    def forward(self, input_: torch.Tensor, weight: Optional[torch.Tensor] = None):
-        """Forward of ColumnParallelLinear
-
-        Args:
-            input_: 3D tensor whose order of dimension is [sequence, batch, hidden]
-
-            weight (optional): weight tensor to use, compulsory when
-                skip_weight_param_allocation is True.
-
-        Returns:
-            - output
-            - bias
-
-        """
-        if weight is None:
-            if self.weight is None:
-                raise RuntimeError(
-                    "weight was not supplied to ColumnParallelLinear forward pass "
-                    "and skip_weight_param_allocation is True."
-                )
-            weight = self.weight
-        else:
-            # Check the weight passed in is the correct shape
-            expected_shape = (self.output_size_per_partition, self.input_size)
-            if weight.shape != expected_shape:
-                raise RuntimeError(
-                    f"supplied weight's shape is {tuple(weight.shape)}, "
-                    f"not {expected_shape} as expected"
-                )
-
-        bias = self.bias if not self.skip_bias_add else None
-        # non-expert only tensor parallelism
-        if self.async_tensor_model_parallel_allreduce or self.sequence_parallel or self.is_expert_without_slicing:
-            input_parallel = input_
-        else:
-            input_parallel = copy_to_tensor_model_parallel_region(input_)
-        # Matrix multiply.
-        output_parallel = self._forward_impl(
-            inputs=input_parallel,
-            weight=weight,
-            bias=bias,
-            gradient_accumulation_fusion=self.gradient_accumulation_fusion,
-            async_grad_allreduce=self.async_tensor_model_parallel_allreduce,
-            sequence_parallel=self.sequence_parallel,
-        )
-        if self.gather_output and not self.is_expert_without_slicing:
-            # All-gather across the partitions.
-            if self.sequence_parallel:
-                raise Exception("sequence parallel must be false")
-            output = gather_from_tensor_model_parallel_region(output_parallel)
-        else:
-            output = output_parallel
-        output_bias = self.bias if self.skip_bias_add else None
-        return output, output_bias
-    
-    def extra_repr(self) -> str:
-        return f'input_size={self.input_size}, output_size={self.output_size}, bias={self.bias is not None}'
-
-
-class RowParallelLinear(torch.nn.Module):
-    """Linear layer with row parallelism.
-
-    The linear layer is defined as Y = XA + b. A is parallelized along
-    its first dimension and X along its second dimension as:
-               -   -
-              | A_1 |
-              | .   |
-          A = | .   |        X = [X_1, ..., X_p]
-              | .   |
-              | A_p |
-               -   -
-    Arguments:
-        input_size: first dimension of matrix A.
-        output_size: second dimension of matrix A.
-
-    Keyword Arguments:
-        bias: If true, add bias. Note that bias is not parallelized.
-        input_is_parallel: If true, we assume that the input is already
-                           split across the GPUs and we do not split
-                           again.
-        init_method: method to initialize weights. Note that bias is always set
-                     to zero.
-        stride: For the strided linear layers.
-        keep_master_weight_for_test: This was added for testing and should be
-                                     set to False. It returns the master weights
-                                     used for initialization.
-        skip_bias_add: If True, do not add the bias term, instead
-                       return it to be added by the caller. This
-                       enables performance optimations where bias can
-                       be fused with other elementwise operations.
-        config: ModelParallelConfig object
-
-    """
-
-    def __init__(
-            self,
-            input_size: int,
-            output_size: int,
-            *,
-            config: ModelParallelConfig = None,
-            init_method: Callable = init.xavier_normal_,
-            bias: bool = True,
-            input_is_parallel: bool = False,
-            stride: int = 1,
-            keep_master_weight_for_test: bool = False,
-            skip_bias_add: bool = False,
-            moe=False, enable_expert_tensor_parallelism=False,
-            dtype=None
-    ):
-        torch.nn.Module.__init__(self)
-
-        # Keep input parameters
-        self.input_size = input_size
-        self.output_size = output_size
-        self.input_is_parallel = input_is_parallel
-        # Divide the weight matrix along the last dimension.
-
-        if moe and (not enable_expert_tensor_parallelism):
-            world_size = 1
-        else:
-            world_size = get_tensor_model_parallel_world_size()
-
-        self.is_expert_without_slicing = moe and world_size == 1
-
-        self.input_size_per_partition = divide(input_size, world_size)
-        self.skip_bias_add = skip_bias_add
-        if config is None:
-            config = core_transformer_config_from_args(get_args())
-        self.config = config
-        self.gradient_accumulation_fusion = config.gradient_accumulation_fusion if config else False
-        self.sequence_parallel = config.sequence_parallel if config else False
-        if self.sequence_parallel and not self.input_is_parallel:
-            raise RuntimeError("To enable `sequence_parallel`, `input_is_parallel` must be `True`")
-
-        # Parameters.
-        # Note: torch.nn.functional.linear performs XA^T + b and as a result
-        # we allocate the transpose.
-        # Initialize weight.
-        dtype = config.params_dtype if dtype is None else dtype
-        if config.use_cpu_initialization:
-            self.weight = Parameter(
-                torch.empty(
-                    self.output_size, self.input_size_per_partition, dtype=dtype
-                )
-            )
-            if config.perform_initialization:
-                self.master_weight = _initialize_affine_weight_cpu(
-                    self.weight,
-                    self.output_size,
-                    self.input_size,
-                    self.input_size_per_partition,
-                    1,
-                    init_method,
-                    stride=stride,
-                    return_master_weight=keep_master_weight_for_test,
-                    params_dtype=dtype,
-                )
-        else:
-            self.weight = Parameter(
-                torch.empty(
-                    self.output_size,
-                    self.input_size_per_partition,
-                    device=torch.cuda.current_device(),
-                    dtype=dtype,
-                )
-            )
-            if config.perform_initialization:
-                _initialize_affine_weight_gpu(
-                    self.weight, init_method, partition_dim=1, stride=stride
-                )
-        if bias:
-            if config.use_cpu_initialization:
-                self.bias = Parameter(torch.empty(self.output_size, dtype=dtype))
-            else:
-                self.bias = Parameter(
-                    torch.empty(
-                        self.output_size,
-                        device=torch.cuda.current_device(),
-                        dtype=dtype,
-                    )
-                )
-            setattr(self.bias, 'sequence_parallel', self.sequence_parallel)
-
-            if config.perform_initialization:
-                # Always initialize bias to zero.
-                with torch.no_grad():
-                    self.bias.zero_()
-        else:
-            self.register_parameter('bias', None)
-
-        self._forward_impl = linear_with_grad_accumulation_and_async_allreduce
-
-    def forward(self, input_):
-        """Forward of RowParallelLinear
-
-        Args:
-            input_: 3D tensor whose order of dimension is [sequence, batch, hidden]
-
-        Returns:
-            - output
-            - bias
-        """
-        # Set up backprop all-reduce.
-        if self.input_is_parallel or self.is_expert_without_slicing:
-            input_parallel = input_
-        else:
-            if self.sequence_parallel:
-                raise Exception("sequence parallel must be false")
-            input_parallel = scatter_to_tensor_model_parallel_region(input_)
-        # Matrix multiply.
-        output_parallel = self._forward_impl(
-            inputs=input_parallel,
-            weight=self.weight,
-            bias=None,
-            gradient_accumulation_fusion=self.gradient_accumulation_fusion,
-            async_grad_allreduce=False,
-            sequence_parallel=False,
-        )
-
-        # All-reduce across all the partitions.
-        if self.sequence_parallel:
-            output_ = reduce_scatter_to_sequence_parallel_region(output_parallel)
-        elif self.is_expert_without_slicing: # non-expert only tensor-parallelism
-            output_ = output_parallel
-        else:
-            output_ = reduce_from_tensor_model_parallel_region(output_parallel)
-        if not self.skip_bias_add:
-            output = output_ + self.bias if self.bias is not None else output_
-            output_bias = None
-        else:
-            output = output_
-            output_bias = self.bias
-        return output, output_bias
-    
-    def extra_repr(self) -> str:
-        return f'input_size={self.input_size}, output_size={self.output_size}, bias={self.bias is not None}'
--- a/modellink/core/tensor_parallel/mappings.py
+++ b/modellink/core/tensor_parallel/mappings.py
@ -1,295 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-
-from modellink.core.parallel_state import (
-    get_tensor_model_parallel_group,
-    get_tensor_model_parallel_world_size,
-    get_tensor_model_parallel_rank
-)
-from modellink.core.tensor_parallel.utils import split_tensor_along_last_dim
-from modellink.error_utils import check_divisible
-
-
-def _reduce(input_):
-    """All-reduce the the input tensor across model parallel group."""
-
-    # Bypass the function if we are using only 1 GPU.
-    if get_tensor_model_parallel_world_size() == 1:
-        return input_
-
-    # All-reduce.
-    torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())
-
-    return input_
-
-
-def _split_along_last_dim(input_):
-    """Split the tensor along its last dimension and keep the
-    corresponding slice."""
-
-    world_size = get_tensor_model_parallel_world_size()
-    # Bypass the function if we are using only 1 GPU.
-    if world_size == 1:
-        return input_
-
-    # Split along last dimension.
-    input_list = split_tensor_along_last_dim(input_, world_size)
-
-    # Note: torch.split does not create contiguous tensors by default.
-    rank = get_tensor_model_parallel_rank()
-    output = input_list[rank].contiguous()
-
-    return output
-
-
-def _gather_along_last_dim(input_):
-    """Gather tensors and concatinate along the last dimension."""
-
-    world_size = get_tensor_model_parallel_world_size()
-    # Bypass the function if we are using only 1 GPU.
-    if world_size == 1:
-        return input_
-
-    # Size and dimension.
-    last_dim = input_.dim() - 1
-    rank = get_tensor_model_parallel_rank()
-
-    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
-    tensor_list[rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=get_tensor_model_parallel_group())
-
-    # Note: torch.cat already creates a contiguous tensor.
-    output = torch.cat(tensor_list, dim=last_dim).contiguous()
-
-    return output
-
-
-def _gather_along_first_dim(input_):
-    """Gather tensors and concatinate along the first dimension."""
-
-    world_size = get_tensor_model_parallel_world_size()
-    # Bypass the function if we are using only 1 GPU.
-    if world_size == 1:
-        return input_
-
-    dim_size = list(input_.size())
-    dim_size[0] = dim_size[0] * world_size
-
-    output = torch.empty(dim_size, dtype=input_.dtype,
-                         device=torch.cuda.current_device())
-    torch.distributed._all_gather_base(output, input_.contiguous(),
-                                       group=get_tensor_model_parallel_group())
-
-    return output
-
-
-def _reduce_scatter_along_first_dim(input_):
-    """Reduce-scatter the input tensor across model parallel group."""
-    world_size = get_tensor_model_parallel_world_size()
-    # Bypass the function if we are using only 1 GPU.
-    if world_size == 1:
-        return input_
-
-    dim_size = list(input_.size())
-    check_divisible(dim_size[0], world_size, 
-        error_info="{} % {}, First dimension of the tensor should be divisible by tensor parallel size")
-
-    dim_size[0] = dim_size[0] // world_size
-
-    output = torch.empty(dim_size, dtype=input_.dtype,
-                         device=torch.cuda.current_device())
-
-    torch.distributed._reduce_scatter_base(output, input_.contiguous(),
-                                           group=get_tensor_model_parallel_group())
-    return output
-
-
-def _split_along_first_dim(input_):
-    """Split the tensor along its first dimension and keep the
-    corresponding slice."""
-
-    world_size = get_tensor_model_parallel_world_size()
-    # Bypass the function if we are using only 1 GPU.
-    if world_size == 1:
-        return input_
-
-    # Split along first dimension.
-    dim_size = input_.size()[0]
-    check_divisible(dim_size, world_size, 
-        error_info="{} % {}, First dimension of the tensor should be divisible by tensor parallel size")
-    local_dim_size = dim_size // world_size
-    rank = get_tensor_model_parallel_rank()
-    dim_offset = rank * local_dim_size
-
-    output = input_[dim_offset:dim_offset + local_dim_size].contiguous()
-
-    return output
-
-
-class _ScatterToSequenceParallelRegion(torch.autograd.Function):
-    """Split the input and keep only the corresponding chuck to the rank."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _split_along_first_dim(input_)
-
-    @staticmethod
-    def forward(ctx, input_):
-        return _split_along_first_dim(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _gather_along_first_dim(grad_output)
-
-
-class _GatherFromSequenceParallelRegion(torch.autograd.Function):
-    """Gather the input from sequence parallel region and concatinate."""
-
-    @staticmethod
-    def symbolic(graph, input_, tensor_parallel_output_grad=True):
-        return _gather_along_first_dim(input_)
-
-    @staticmethod
-    def forward(ctx, input_, tensor_parallel_output_grad=True):
-        ctx.tensor_parallel_output_grad = tensor_parallel_output_grad
-        return _gather_along_first_dim(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        tensor_parallel_output_grad = ctx.tensor_parallel_output_grad
-
-        # If the computation graph after the gather operation is
-        # in the tensor parallel mode, output gradients need to reduce
-        # scattered and whereas if the computation is duplicated,
-        # output gradients need to be scattered.
-        if tensor_parallel_output_grad:
-            return _reduce_scatter_along_first_dim(grad_output), None
-        else:
-            return _split_along_first_dim(grad_output), None
-
-
-class _CopyToModelParallelRegion(torch.autograd.Function):
-    """Pass the input to the model parallel region."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return input_
-    
-    @staticmethod
-    def forward(ctx, input_):
-        return input_
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _reduce(grad_output)
-
-
-class _ReduceFromModelParallelRegion(torch.autograd.Function):
-    """All-reduce the input from the model parallel region."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _reduce(input_)
-    
-    @staticmethod
-    def forward(ctx, input_):
-        return _reduce(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output
-
-
-class _ScatterToModelParallelRegion(torch.autograd.Function):
-    """Split the input and keep only the corresponding chuck to the rank."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _split_along_last_dim(input_)
-
-    @staticmethod
-    def forward(ctx, input_):
-        return _split_along_last_dim(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _gather_along_last_dim(grad_output)
-
-
-class _ReduceScatterToSequenceParallelRegion(torch.autograd.Function):
-    """Reduce scatter the input from the model parallel region."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _reduce_scatter_along_first_dim(input_)
-
-    @staticmethod
-    def forward(ctx, input_):
-        return _reduce_scatter_along_first_dim(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _gather_along_first_dim(grad_output)
-
-
-
-class _GatherFromModelParallelRegion(torch.autograd.Function):
-    """Gather the input from model parallel region and concatinate."""
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _gather_along_last_dim(input_)
-    
-    @staticmethod
-    def forward(ctx, input_):
-        return _gather_along_last_dim(input_)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _split_along_last_dim(grad_output)
-
-
-# -----------------
-# Helper functions.
-# -----------------
-
-def copy_to_tensor_model_parallel_region(input_):
-    return _CopyToModelParallelRegion.apply(input_)
-
-
-def reduce_from_tensor_model_parallel_region(input_):
-    return _ReduceFromModelParallelRegion.apply(input_)
-
-
-def scatter_to_tensor_model_parallel_region(input_):
-    return _ScatterToModelParallelRegion.apply(input_)
-
-
-def gather_from_sequence_parallel_region(input_, tensor_parallel_output_grad=True):
-    return _GatherFromSequenceParallelRegion.apply(input_, tensor_parallel_output_grad)
-
-
-def gather_from_tensor_model_parallel_region(input_):
-    return _GatherFromModelParallelRegion.apply(input_)
-
-
-def reduce_scatter_to_sequence_parallel_region(input_):
-    return _ReduceScatterToSequenceParallelRegion.apply(input_)
-
-
-def scatter_to_sequence_parallel_region(input_):
-    return _ScatterToSequenceParallelRegion.apply(input_)
--- a/modellink/core/tensor_parallel/random.py
+++ b/modellink/core/tensor_parallel/random.py
@ -1,309 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License.
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# Parts of the code here are adapted from PyTorch
-
-import contextlib
-
-import torch
-from torch import _C
-from torch.utils.checkpoint import detach_variable
-from torch.cuda import device as device_ctx_manager
-from torch.cuda import _lazy_call
-from modellink import get_args
-from modellink.memory import allocate_mem_buff
-
-from modellink.core.parallel_state import get_data_parallel_rank
-from modellink.core.parallel_state import get_tensor_model_parallel_group
-from modellink.core.parallel_state import get_tensor_model_parallel_rank
-from modellink.core.parallel_state import get_tensor_model_parallel_world_size
-from modellink.core.utils import safely_set_viewless_tensor_data
-from modellink.core.tensor_parallel.utils import (
-    split_tensor_into_1d_equal_chunks,
-    gather_split_1d_tensor
-)
-from modellink.error_utils import check_divisible
-from modellink.error_utils import ensure_var_is_none
-
-# Default name for the model parallel rng tracker.
-_MODEL_PARALLEL_RNG_TRACKER_NAME = 'model-parallel-rng'
-
-# Whether apply model parallelsim to checkpointed hidden states.
-_CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER = None
-
-
-def init_checkpointed_activations_memory_buffer():
-    """Initializ the memory buffer for the checkpointed activations."""
-    args = get_args()
-
-    per_layer = args.micro_batch_size * args.max_position_embeddings * \
-                args.hidden_size // args.tensor_model_parallel_size
-    check_divisible(args.num_layers, args.checkpoint_num_layers, 
-        error_info='{} % {}, number of layers is not divisible by checkpoint-num-layers')
-    num_checkpointer_layers = args.num_layers // args.checkpoint_num_layers
-    numel = per_layer * num_checkpointer_layers
-    dtype = torch.half
-    if not args.fp16:
-        dtype = torch.float
-
-    global _CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER
-    ensure_var_is_none(_CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER, 
-        error_message='checkpointed activations memory buffer is already allocated.')
-    _CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER = allocate_mem_buff(
-        'checkpointed activations', numel, dtype, track_usage=False)
-
-
-def reset_checkpointed_activations_memory_buffer():
-    """Reset the memory used for checkpointing."""
-    if _CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER is not None:
-        _CHECKPOINTED_ACTIVATIONS_MEMORY_BUFFER.reset()
-
-
-def _set_cuda_rng_state(new_state, device=-1):
-    """Sets the random number generator state of the current GPU.
-
-    Argumentss:
-        new_state (torch.ByteTensor): The desired state
-    This function is adapted from PyTorch repo (torch.cuda.set_rng_state)
-    with a single change: the input state is not cloned. Cloning caused
-    major performance issues for +4 GPU cases.
-    """
-    if hasattr(_C, '_cuda_setRNGState') and callable(_C._cuda_setRNGState):
-        # older PyTorch
-        def cb():
-            with device_ctx_manager(device):
-                _C._cuda_setRNGState(new_state)
-    else:
-        # newer PyTorch
-        if device == -1:
-            device = torch.device('cuda')
-        elif isinstance(device, str):
-            device = torch.device(device)
-        elif isinstance(device, int):
-            device = torch.device('cuda', device)
-
-        def cb():
-            idx = device.index
-            if idx is None:
-                idx = torch.cuda.current_device()
-
-            default_generator = torch.cuda.default_generator[idx]
-            default_generator.set_state(new_state)
-
-    _lazy_call(cb)
-
-
-class CudaRNGStatesTracker:
-    """Tracker for the cuda RNG states.
-
-    Using the `add` method, a cuda rng state is initialized based on
-    the input `seed` and is assigned to `name`. Later, by forking the
-    rng state, we can perform operations and return to our starting
-    cuda state.
-    """
-
-    def __init__(self):
-        # Map from a string name to the cuda rng state.
-        self.states_ = {}
-        # Seeds are just for book keeping and ensure no seed is set twice.
-        self.seeds_ = set()
-
-    def reset(self):
-        """Set to the initial state (no tracker)."""
-        self.states_ = {}
-        self.seeds_ = set()
-
-    def get_states(self):
-        """Get rng states. Copy the dictionary so we have direct
-        pointers to the states, not just a pointer to the dictionary.
-        """
-        states = {}
-        for name in self.states_:
-            states[name] = self.states_[name]
-        return states
-
-    def set_states(self, states):
-        """Set the rng states. For efficiency purposes, we do not check
-        the size of seed for compatibility.
-        """
-        self.states_ = states
-
-    def add(self, name, seed):
-        """Track the rng state."""
-        # Check seed is not already used.
-        if seed in self.seeds_:
-            raise Exception('seed {} already exists'.format(seed))
-        self.seeds_.add(seed)
-        # Check that state is not already defined.
-        if name in self.states_:
-            raise Exception('cuda rng state {} already exists'.format(name))
-        # Get the current rng state.
-        orig_rng_state = torch.cuda.get_rng_state()
-        # Set the new state and store it.
-        torch.cuda.manual_seed(seed)
-        self.states_[name] = torch.cuda.get_rng_state()
-        # Reset rng state to what it was.
-        _set_cuda_rng_state(orig_rng_state)
-
-    @contextlib.contextmanager
-    def fork(self, name=_MODEL_PARALLEL_RNG_TRACKER_NAME):
-        """Fork the cuda rng state, perform operations, and exit with
-        the original state.
-        """
-        # Check if we have added the state
-        if name not in self.states_:
-            raise Exception('cuda rng state {} is not added'.format(name))
-        # Store current rng state.
-        orig_cuda_rng_state = torch.cuda.get_rng_state()
-        # Set rng state to the desired one
-        _set_cuda_rng_state(self.states_[name])
-        # Do the stuff we wanted to do.
-        try:
-            yield
-        finally:
-            # Update the current rng state for later use.
-            self.states_[name] = torch.cuda.get_rng_state()
-            # And set the state to the original state we started with.
-            _set_cuda_rng_state(orig_cuda_rng_state)
-
-
-# RNG tracker object.
-_CUDA_RNG_STATE_TRACKER = CudaRNGStatesTracker()
-
-
-def get_cuda_rng_tracker():
-    """Get cuda rng tracker."""
-    return _CUDA_RNG_STATE_TRACKER
-
-
-def model_parallel_cuda_manual_seed(seed):
-    """Initialize model parallel cuda seed.
-
-    This function should be called after the model parallel is
-    initialized. Also, no torch.cuda.manual_seed should be called
-    after this function. Basically, this is replacement for that
-    function.
-    Two set of RNG states are tracked:
-        default state: This is for data parallelism and is the same among a
-                       set of model parallel GPUs but different across
-                       different model paralle groups. This is used for
-                       example for dropout in the non-tensor-model-parallel regions.
-        tensor-model-parallel state: This state is different among a set of model
-                              parallel GPUs, but the same across data parallel
-                              groups. This is used for example for dropout in
-                              model parallel regions.
-    """
-    # 2718 is just for fun and any POSITIVE value will work.
-    offset = seed + 2718
-    tensor_model_parallel_seed = offset + get_tensor_model_parallel_rank()
-    # Data parallel gets the original seed.
-    data_parallel_seed = seed
-
-    if torch.distributed.get_rank() == 0:
-        print('> initializing model parallel cuda seeds on global rank {}, '
-              'model parallel rank {}, and data parallel rank {} with '
-              'model parallel seed: {} and data parallel seed: {}'.format(
-            torch.distributed.get_rank(), get_tensor_model_parallel_rank(),
-            get_data_parallel_rank(), tensor_model_parallel_seed,
-            data_parallel_seed), flush=True)
-    _CUDA_RNG_STATE_TRACKER.reset()
-    # Set the default state.
-    torch.cuda.manual_seed(data_parallel_seed)
-    # and model parallel state.
-    _CUDA_RNG_STATE_TRACKER.add(_MODEL_PARALLEL_RNG_TRACKER_NAME,
-                                tensor_model_parallel_seed)
-
-
-class CheckpointFunction(torch.autograd.Function):
-    """This function is adapted from torch.utils.checkpoint with
-       two main changes:
-           1) torch.cuda.set_rng_state is replaced with `_set_cuda_rng_state`
-           2) the states in the model parallel tracker are also properly
-              tracked/set/reset.
-    """
-
-    @staticmethod
-    def forward(ctx, run_function, distribute_saved_activations, *args):
-        ctx.run_function = run_function
-        ctx.distribute_saved_activations = distribute_saved_activations
-
-        # Copy the rng states.
-        ctx.fwd_cpu_rng_state = torch.get_rng_state()
-        ctx.fwd_cuda_rng_state = torch.cuda.get_rng_state()
-        ctx.fwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
-
-        with torch.no_grad():
-            outputs = run_function(*args)
-
-        # Divide hidden states across model parallel group and only keep
-        # the chunk corresponding to the current rank.
-        if distribute_saved_activations:
-            ctx.input_0_shape = args[0].data.shape
-            safely_set_viewless_tensor_data(
-                args[0], split_tensor_into_1d_equal_chunks(args[0].data, new_buffer=True)
-            )
-
-        # Store everything.
-        ctx.save_for_backward(*args)
-
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        if not torch.autograd._is_checkpoint_valid():
-            raise RuntimeError("Checkpointing is not compatible with .grad(), "
-                               "please use .backward() if possible")
-        inputs = ctx.saved_tensors
-        if ctx.distribute_saved_activations:
-            safely_set_viewless_tensor_data(
-                inputs[0], gather_split_1d_tensor(inputs[0].data).view(ctx.input_0_shape)
-            )
-
-        # Store the current states.
-        bwd_cpu_rng_state = torch.get_rng_state()
-        bwd_cuda_rng_state = torch.cuda.get_rng_state()
-        bwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
-
-        # Set the states to what it used to be before the forward pass.
-        torch.set_rng_state(ctx.fwd_cpu_rng_state)
-        _set_cuda_rng_state(ctx.fwd_cuda_rng_state)
-        get_cuda_rng_tracker().set_states(ctx.fwd_cuda_rng_state_tracker)
-
-        # Compute the forward pass.
-        detached_inputs = detach_variable(inputs)
-        with torch.enable_grad():
-            outputs = ctx.run_function(*detached_inputs)
-
-        # Set the states back to what it was at the start of this function.
-        torch.set_rng_state(bwd_cpu_rng_state)
-        _set_cuda_rng_state(bwd_cuda_rng_state)
-        get_cuda_rng_tracker().set_states(bwd_cuda_rng_state_tracker)
-
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs,)
-        elif len(outputs) == 2 and isinstance(outputs[1], torch.Tensor) and \
-                torch.equal(outputs[1], torch.tensor(0, dtype=outputs[1].dtype).to(torch.cuda.current_device())):
-            # a hacky solution to overcome issue when running old script examples/pretrain_gpt_distributed.sh
-            outputs = (outputs[0],)
-        torch.autograd.backward(outputs, args)
-        grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else inp
-                      for inp in detached_inputs)
-        return (None, None) + grads
-
-
-def checkpoint(function, distribute_saved_activations, *args):
-    """Checkpoint a model or part of the model.
-    This has been directly copied from torch.utils.checkpoint.
-    """
-    return CheckpointFunction.apply(function, distribute_saved_activations, *args)
--- a/modellink/core/tensor_parallel/utils.py
+++ b/modellink/core/tensor_parallel/utils.py
@ -1,109 +0,0 @@
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
-
-from typing import List, Sequence
-
-import torch
-from modellink.core import parallel_state
-from modellink.core.utils import divide
-
-
-def split_tensor_along_last_dim(
-    tensor: torch.Tensor, num_partitions: int, contiguous_split_chunks: bool = False,
-) -> List[torch.Tensor]:
-    """ Split a tensor along its last dimension.
-
-        Arguments:
-            tensor: input tensor.
-            num_partitions: number of partitions to split the tensor
-            contiguous_split_chunks: If True, make each chunk contiguous
-                                     in memory.
-
-        Returns:
-            A list of Tensors
-    """
-    # Get the size and dimension.
-    last_dim = tensor.dim() - 1
-    last_dim_size = divide(tensor.size()[last_dim], num_partitions)
-    # Split.
-    tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
-    # Note: torch.split does not create contiguous tensors by default.
-    if contiguous_split_chunks:
-        return tuple(chunk.contiguous() for chunk in tensor_list)
-
-    return tensor_list
-
-
-def split_tensor_into_1d_equal_chunks(tensor, new_buffer=False):
-    """ Break a tensor into equal 1D chunks across tensor parallel ranks.
-
-        Returns a Tensor or View with this rank's portion of the data.
-
-        Arguments:
-            tensor: The tensor to split
-
-        Keyword Arguments:
-            new_buffer (bool): If True, returns a new Tensor.
-                               If False, returns a view into the existing Tensor.
-                               Default is False
-
-    """
-    partition_size = torch.numel(tensor) // parallel_state.get_tensor_model_parallel_world_size()
-    start_index = partition_size * parallel_state.get_tensor_model_parallel_rank()
-    end_index = start_index + partition_size
-    if new_buffer:
-        data = torch.empty(
-            partition_size,
-            dtype=tensor.dtype,
-            device=torch.cuda.current_device(),
-            requires_grad=False,
-        )
-        data.copy_(tensor.view(-1)[start_index:end_index])
-    else:
-        data = tensor.view(-1)[start_index:end_index]
-    return data
-
-
-def gather_split_1d_tensor(tensor):
-    """ Opposite of split_tensor_into_1d_equal_chunks. Gather values from tensor
-        model parallel ranks.
-
-        Returns a new Tensor with the gathered data.
-
-        Arguments:
-            tensor: A Tensor or view of this rank's portion of the data.
-    """
-    numel_gathered = torch.numel(tensor) * parallel_state.get_tensor_model_parallel_world_size()
-    gathered = torch.empty(
-        numel_gathered, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False
-    )
-    # This API is experimental in pytorch (as of Feb 2022) and
-    # this might break in future pytorch releases. We chose this API
-    # as opposed to torch.distributed.all_gather for efficiency reasons.
-    # This API calls directly NCCL all-gather versus the former does
-    # internal copies and can potentially cause slow down.
-    torch.distributed._all_gather_base(
-        gathered, tensor, group=parallel_state.get_tensor_model_parallel_group()
-    )
-    return gathered
-
-
-class VocabUtility:
-    """ Split the vocabulary into `world_size` chunks and return the first
-        and last index of the vocabulary belonging to the `rank`
-        partition: Note that indices in [fist, last)
-
-    """
-    @staticmethod
-    def vocab_range_from_per_partition_vocab_size(
-        per_partition_vocab_size: int, rank, world_size: int
-    ) -> Sequence[int]:
-        index_f = rank * per_partition_vocab_size
-        index_l = index_f + per_partition_vocab_size
-        return index_f, index_l
-
-    @staticmethod
-    def vocab_range_from_global_vocab_size(global_vocab_size: int, rank: int, world_size: int) -> Sequence[int]:
-        per_partition_vocab_size = divide(global_vocab_size, world_size)
-        return VocabUtility.vocab_range_from_per_partition_vocab_size(
-            per_partition_vocab_size, rank, world_size
-        )
--- a/modellink/core/transformer/init.py
+++ b/modellink/core/transformer/init.py
@ -1,3 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-
-from .transformer_config import TransformerConfig
--- a/modellink/core/transformer/module/init.py
+++ b/modellink/core/transformer/module/init.py
--- a/modellink/core/transformer/module/flash_attention.py
+++ b/modellink/core/transformer/module/flash_attention.py
@ -1,52 +0,0 @@
-# coding=utf-8
-# Copyright Information:
-# Huawei Technologies Co., Ltd. All Rights Reserved © 2020
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-import torch_npu
-
-
-class FlashSelfAttention(torch.nn.Module):
-    """Implement the scaled dot product attention with softmax.
-    Arguments
-    ---------
-        softmax_scale: The temperature to use for the softmax attention.
-                      (default: 1/sqrt(d_keys) where d_keys is computed at
-                      runtime)
-        attention_dropout: The dropout rate to apply to the attention
-                           (default: 0.0)
-    """
-
-    def __init__(self, causal=False, softmax_scale=1., attention_dropout=0.):
-        super().__init__()
-        self.causal = causal
-        self.softmax_scale = softmax_scale
-        self.dropout_p = attention_dropout
-
-    def forward(self, qkvn, pse, attention_mask):
-        q, k, v, n = qkvn
-
-        if self.causal:
-            output = torch_npu.npu_fusion_attention(
-                q, k, v, n, "SBH",
-                pse=pse,
-                padding_mask=None,
-                atten_mask=attention_mask,
-                scale=self.softmax_scale,
-                pre_tockens=k.shape[0],  # seq_len
-                next_tockens=0,  # 0
-                keep_prob=1 - self.dropout_p,
-            )[0]
-            return output
-        raise Exception("the attention type {} is not support!".format(self.attention_type))
--- a/modellink/core/transformer/module/triangle_attention.py
+++ b/modellink/core/transformer/module/triangle_attention.py
@ -1,120 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2023, HUAWEI CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import torch
-import torch.nn as nn
-try:    
-    import torch_npu    
-except ImportError:
-    logging.warning("Import torch_npu Error.")
-
-
-class TriangleAttention(nn.Module):
-    """
-    The triangle attention reduces the attention calculation of the mask 
-    part by dividing the q, k, and v matrices into blocks
-
-    Arguments:
-        block_size: The size of the inverted triangle block, the default is 512, 
-                    the smaller the block_size, the more calculations will be reduced, 
-                    but the number of small operators will be increased
-        masked_softmax_func: mask function to be applied.
-        dropout_func: dropout function to be applied.
-    """
-
-    def __init__(self, block_size=512, masked_softmax_func=None, dropout_func=None):
-        super(TriangleAttention, self).__init__()
-        self.block_size = block_size
-        self.mask_tmp_initialed = False
-        self.mask_tmp_groups = []
-        if masked_softmax_func is not None:
-            self.scaled_masked_softmax = masked_softmax_func
-        else:
-            self.scaled_masked_softmax = torch_npu.npu_scaled_masked_softmax
-        if dropout_func:
-            self.dropout = True
-            self.attn_dropout = dropout_func
-        else:
-            self.dropout = False
-            
-    def compute_attn(self, q_layer, k_layer, v_layer, mask_tmp):
-        # [b, hn, q_size, hd] * [b, hn, hd, kv_size] -> [b, hn, q_size, kv_size]
-        cur_sim = torch.matmul(q_layer, k_layer)
-
-        attention_probs = self.scaled_masked_softmax(cur_sim, mask_tmp)
-
-        # attention dropout
-        if self.dropout:
-            attention_probs = self.attn_dropout(attention_probs)
-
-        # [b, hn, q_size, kv_size] * [b, hn, kv_size, hd] -> [b, hn, q_size, hd]
-        context_layer_tmp = torch.matmul(attention_probs, v_layer)
-        return context_layer_tmp
-
-    def forward(self, query_layer, key_layer, value_layer, attention_mask):
-        #  input shape: [b, hn, sq, hd]
-        bsz, head_num, sequence_len, head_dim = key_layer.shape
-        sparse_groups = sequence_len // self.block_size
-        # Determine whether blocks size can be divided by sequence_length
-        flag = sequence_len == self.block_size * sparse_groups
-        key_layer = key_layer.transpose(2, 3).contiguous()
-        if flag:
-            q_tmp_layers = torch.chunk(query_layer, sparse_groups, 2)
-            k_tmp_layers = torch.chunk(key_layer, sparse_groups, 3)
-            v_tmp_layers = torch.chunk(value_layer, sparse_groups, 2)
-        else:
-            seq_tmp = self.block_size * sparse_groups
-            q_last = query_layer[:, :, seq_tmp:, :].contiguous()
-            mask_last = attention_mask[:, :, seq_tmp:, :].contiguous()
-            q_tmp_layers = torch.chunk(query_layer[:, :, :seq_tmp, :], sparse_groups, 2)
-            k_tmp_layers = torch.chunk(key_layer[:, :, :, :seq_tmp], sparse_groups, 3)
-            v_tmp_layers = torch.chunk(value_layer[:, :, :seq_tmp, :], sparse_groups, 2)
-        context_list_tmp, k_tmp, v_tmp = [], (), ()
-        for i in range(sparse_groups):
-            # compute slice shape of q k v for each loop
-            q_begin, q_end = i * self.block_size, (i + 1) * self.block_size 
-            kv_begin, kv_end = 0, (i + 1) * self.block_size
-            q_tmp = q_tmp_layers[i]
-            # slice k and v
-            if i == 0:
-                k_tmp = k_tmp_layers[i].contiguous()
-                v_tmp = v_tmp_layers[i].contiguous()
-            else:
-                k_tmp = torch.cat((k_tmp, k_tmp_layers[i]), -1).contiguous()
-                v_tmp = torch.cat((v_tmp, v_tmp_layers[i]), -2).contiguous()
-
-            if not self.mask_tmp_initialed:
-                mask_tmp = attention_mask[:, :, q_begin:q_end, kv_begin:kv_end]
-                self.mask_tmp_groups.append(mask_tmp.contiguous())
-            else:
-                mask_tmp = self.mask_tmp_groups[i]
-
-            context_layer_tmp = self.compute_attn(q_tmp, k_tmp, v_tmp, mask_tmp)
-            context_list_tmp.append(context_layer_tmp)
-
-        if not flag:
-            # circumstances that cannot be divisible
-            context_layer_tmp = self.compute_attn(q_last, key_layer, value_layer, mask_last)
-            context_list_tmp.append(context_layer_tmp)
-        context_layer = torch.cat(context_list_tmp, 2)
-        self.mask_tmp_initialed = True
-        new_context_layer_shape = (sequence_len, bsz, head_num * head_dim)
-        context_layer = torch_npu.npu_confusion_transpose(context_layer, [2, 0, 1, 3], [*new_context_layer_shape], True)
-        # =========================
-        # Context layer. [sq, b, hp]
-        # =========================
-        return context_layer
-
--- a/modellink/core/transformer/transformer_config.py
+++ b/modellink/core/transformer/transformer_config.py
@ -1,238 +0,0 @@
-# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
-
-from dataclasses import dataclass
-from typing import Callable
-
-import torch.nn.functional as F
-
-from modellink.core import ModelParallelConfig
-from modellink.core.utils import init_method_normal, scaled_init_method_normal
-
-
-@dataclass
-class TransformerConfig(ModelParallelConfig):
-    """Configuration object for megatron-core transformers.
-
-        Attributes:
-
-        # model architecture
-        num_layers (int): Number of transformer layers in a transformer block.
-        hidden_size (int): Transformer hidden size.
-        ffn_hidden_size (int): Transformer Feed-Forward Network hidden size.
-                                This is set to 4*hidden_size if not provided. Defaults to None.')
-        num_attention_heads (int): Number of transformer attention heads.
-        kv_channels (int): Projection weights dimension in multi-head attention.
-                            This is set to hidden_size // num_attention_heads if not provided.
-                            Defaults to None.
-        num_query_groups (int): Number of query groups for group query attention. If None, normal attention is used.
-
-        hidden_dropout (float): Dropout probability for transformer hidden state. Defaults to 0.1.
-        attention_dropout (float): Post attention dropout probability. Defaults to 0.1.
-        fp32_residual_connection (bool): If true, move residual connections to fp32.
-        apply_residual_connection_post_layernorm (bool): If true, uses the original BERT residule connection ordering.
-                                                         Defaults to False.
-        layernorm_epsilon (float): Layernorm epsilon. Defaults to 1e-5.
-
-        layernorm_zero_centered_gamma (bool): if set to 'True', the LayerNorm is adjusted to center the gamma values
-                                              around 0. This improves numerical stability. Defaults to False.
-
-        add_bias_linear (bool): Include a bias term in all linear layers (QKV projections, after core attention, and two
-                                in MLP layer). Default is True.
-
-        gated_linear_unit (bool): Use a gated linear unit for the first linear layer in the MLP. Defaults to False.
-
-        activation_func (Callable): Activation function to use for the non-linearity in the MLP. Defaults to F.gelu.
-
-        # initialization
-        init_method (Callable): Method to initialize weights. Note that bias is always set to
-                                zero. Should be a function that takes a single Tensor and
-                                initializes it. Defaults to
-                                megatron.core.utils.init_method_normal(init_method_std) which is
-                                torch.nn.init.normal_ with mean=0.0 and std=init_method_Std.
-
-        output_layer_init_method (Callable): Method to initialize weights of the output layer of
-                                             both attention and MLP blocks. Defaults to
-                                             megatron.core.utils.scaled_init_method_normal(init_method_std)
-                                             which is torch.nn.init.normal_ with mean=0.0 and
-                                             std=init_method_std / math.sqrt(2.0 * num_layers).
-
-        init_method_std (float): Standard deviation of the zero mean normal for the default
-                                 initialization method, not used if init_method and
-                                 output_layer_init_method are provided. Defaults to 0.02.
-
-        # mixed-precision
-        apply_query_key_layer_scaling (bool): If true, scale Q * K^T by 1 / layer-number. Defaults to True.
-        attention_softmax_in_fp32 (bool): If true, run attention masking and softmax in fp32.
-                                          This should be true if apply_query_key_layer_scaling is true.
-
-        # fusion
-        bias_gelu_fustion (bool): If true, fuses bias and gelu. Defaults to False.
-        masked_softmax_fusion (bool): If true, uses softmax fusion.
-        persist_layer_norm (bool): If true, uses the persistent fused layer norm kernel.
-                                   This kernel only supports a fixed set of hidden sizes.
-                                   Defaults to False.
-        bias_dropout_fusion (bool): If true, uses bias dropout fusion.
-
-        # activation recomputation
-
-        recompute_granularity (str): These memory intensive activations
-                                     are also less compute intensive which makes activation checkpointing more efficient
-                                     for LLMs (20B+). 'full' will checkpoint the entire transformer layer.
-
-        recompute_method (str): uniform will uniformly divide the total number of transformer layers in a transformer
-                                block and recompute the input activation of each divided chunk at the specified
-                                granularity.  block will recompute the input activations for only a set number of
-                                transformer layers per pipeline stage.  The rest of the layers in the pipeline stage
-                                will not have any activations recomputed.  Must be 'uniform' or 'block'. Defaults to
-                                None.
-
-        recompute_num_layers (int): When recompute_method is uniform, recompute_num_layers is the number of transformer
-                                    layers in each uniformly divided recompute unit.  When recompute_method is block,
-                                    recompute_num_layers is the number of transformer layers to recompute within each
-                                    pipeline stage.  Defaults to None.
-
-        distribute_saved_activations (bool): If true, distribute recomputed activations across the model parallel
-                                             group. Defaults to None.
-
-
-        fp8 (bool): Enables the use of FP8 precision through Transformer Engine.
-
-        fp8_e4m3 (bool): Enables the use of FP8 tensors in e4m3 format for both forward and backward passes.
-
-        fp8_margin (int): Enables the use of FP8 tensors in e4m3 format in the forward pass and e5m2 format in the
-                          backward pass.
-
-        fp8_interval (int): Controls how often the scaling factor is recomputed.
-
-        fp8_amax_history_len (int): The length of the amax history window used for scaling factor computation.
-
-        fp8_amax_compute_algo (str): Algorithm used for choosing the `amax` value for the scaling factor computation.
-                                     There are 2 predefined choices: `max` chooses the largest `amax` in the history
-                                     window, while `most_recent` always chooses the most recently seen value.
-
-    """
-
-    # model architecture
-    num_layers: int = 0
-    hidden_size: int = 0
-    num_attention_heads: int = 0
-    num_query_groups: int = None
-
-    ffn_hidden_size: int = None
-    kv_channels: int = None
-    hidden_dropout: float = 0.1
-    attention_dropout: float = 0.1
-    fp32_residual_connection: bool = False
-    apply_residual_connection_post_layernorm: bool = False
-    layernorm_epsilon: float = 1e-5
-    layernorm_zero_centered_gamma: bool = False
-    add_bias_linear: bool = True
-    gated_linear_unit: bool = False
-    activation_func: Callable = F.gelu
-
-    # initialization
-    init_method: Callable = None
-    output_layer_init_method: Callable = None
-    init_method_std: float = 0.02
-
-    # mixed-precision
-    apply_query_key_layer_scaling: bool = True
-    attention_softmax_in_fp32: bool = True
-
-    # communication
-
-    # fusion
-    bias_gelu_fusion: bool = False  # this should be bias_activation_fusion ?
-    masked_softmax_fusion: bool = False
-    persist_layer_norm: bool = False
-    bias_dropout_fusion: bool = False  # this should be bias_dropout_add_fusion?
-
-    # activation recomputation
-    recompute_granularity: str = None
-    recompute_method: str = None
-    recompute_num_layers: int = None
-    distribute_saved_activations: bool = None
-
-    # fp8 related
-    fp8: bool = False
-    fp8_e4m3: bool = False
-    fp8_margin: int = 0
-    fp8_interval: int = 1
-    fp8_amax_history_len: int = 1
-    fp8_amax_compute_algo: str = "most_recent"
-
-    use_flash_attn: bool = False
-
-    def __post_init__(self):
-        """
-        Python dataclass method that is used to modify attributes after initialization.
-        """
-        super().__post_init__()
-        if self.fp16 and self.bf16:
-            raise ValueError(f'Only one of self.fp16: {self.fp16} and self.bf16 {self.bf16} should be True.')
-
-        if self.num_attention_heads % self.tensor_model_parallel_size != 0:
-            raise ValueError(f"num_attention_heads ({self.num_attention_heads}) must be a multiple of "
-                             f"tensor_model_parallel_size ({self.tensor_model_parallel_size}).")
-
-        if self.ffn_hidden_size is None:
-            self.ffn_hidden_size = 4 * self.hidden_size
-
-        if self.kv_channels is None:
-            self.kv_channels = self.hidden_size // self.num_attention_heads
-
-        if self.num_query_groups is None:
-            self.num_query_groups = self.num_attention_heads
-
-        if self.num_query_groups % self.tensor_model_parallel_size != 0:
-            raise ValueError(f"num_query_groups ({self.num_query_groups}) must be a multiple of "
-                f"tensor_model_parallel_size ({self.tensor_model_parallel_size}).")
-
-        if self.apply_query_key_layer_scaling:
-            self.attention_softmax_in_fp32 = True
-
-        if self.recompute_granularity is not None:
-            self.__recompute_granularity_init()
-
-        if self.apply_query_key_layer_scaling:
-            self.attention_softmax_in_fp32 = True
-
-        if self.bias_gelu_fusion:
-            if not self.add_bias_linear:
-                raise ValueError("When bias_gelu_fusion is True, add_bias_linear must also be True.")
-
-            if self.activation_func != F.gelu:
-                raise ValueError(f'When bias_gelu_fusion is True, activation_func must be F.gelu.')
-
-        if self.init_method is None:
-            self.init_method = init_method_normal(self.init_method_std)
-
-        if self.output_layer_init_method is None:
-            self.output_layer_init_method = scaled_init_method_normal(self.init_method_std, self.num_layers)
-
-    def __recompute_granularity_init(self):
-        if self.recompute_granularity not in ['full']:
-            raise ValueError(
-                f'When using recompute_granuarlity: {self.recompute_granularity} must be "full".')
-
-        if self.recompute_method is not None:
-            if self.recompute_method not in ['block', 'uniform']:
-                raise ValueError(f'recompute_method: {self.recompute_method} must be "block" or "uniform".')
-        elif self.recompute_granularity != 'selective':
-            raise ValueError(f'Using recompute_granularity:\
-             {self.recompute_granularity} so recompute_method must be "block" or "uniform"')
-
-        if self.recompute_num_layers is None:
-            raise ValueError(f'When using recompute_granularity: {self.recompute_granularity}'
-                             f' so recompute_num_layers must be between '
-                             f'1 and num_layers_per_pipeline_rank:'
-                             f' {self.num_layers // self.pipeline_model_parallel_size}')
-
-        if self.distribute_saved_activations and self.sequence_parallel:
-            raise ValueError(f'distribute_saved_activations: {self.distribute_saved_activations} '
-                             f'must be false when sequence parallel is enabled: {self.sequence_parallel}')
-
-        if self.virtual_pipeline_model_parallel_size is not None:
-            if not self.num_layers % self.virtual_pipeline_model_parallel_size == 0:
-                raise ValueError(f'num_layers: {self.num_layers} must be divisible by virtual_model_parallel_size '
-                                 f'{self.virtual_pipeline_model_parallel_size}')
--- a/modellink/core/utils.py
+++ b/modellink/core/utils.py
@ -1,189 +0,0 @@
-"""Utility functions used throughout AscendSpeed core"""
-
-import math
-from functools import reduce
-import operator
-
-import torch
-
-from modellink.error_utils import check_divisible, ensure_var_is_none
-
-
-def divide(numerator, denominator):
-    """
-    Ensure that numerator is divisible by the denominator and return
-    the division value.
-    """
-    check_divisible(numerator, denominator)
-    return numerator // denominator
-
-
-def get_attr_wrapped_model(model, attr, allow_none=True):
-    """Get an attribute from a wrapped model"""
-    if isinstance(model, list):
-        raise RuntimeError("_get_attr_wrapped_model given a list of models")
-
-    if allow_none:
-
-        def condition(model, attr):
-            return not hasattr(model, attr)
-
-    else:
-
-        def condition(model, attr):
-            return getattr(model, attr, None) is None
-
-    while condition(model, attr):
-        if not hasattr(model, "module"):
-            raise RuntimeError(f"_get_attr_wrapped_model couldn't find attribute {attr}")
-
-        model = model.module
-    return getattr(model, attr)
-
-
-def get_model_type(model):
-    return get_attr_wrapped_model(model, 'model_type')
-
-
-def get_model_config(model):
-    from modellink import get_args
-    from modellink.arguments import core_transformer_config_from_args
-    args = get_args()
-    if args.deepspeed:
-        return core_transformer_config_from_args(get_args())
-    return get_attr_wrapped_model(model, 'config', allow_none=False)
-
-
-class GlobalMemoryBuffer:
-    """
-    Global buffer to avoid dynamic memory allocations.
-    Caller should ensure that buffers of the same name
-    are not used concurrently.
-    """
-
-    def __init__(self):
-        self.buffer = {}
-
-    def get_tensor(self, tensor_shape, dtype, name):
-        required_len = reduce(operator.mul, tensor_shape, 1)
-        if self.buffer.get((name, dtype), None) is None or \
-                self.buffer[(name, dtype)].numel() < required_len:
-            self.buffer[(name, dtype)] = \
-                torch.empty(required_len,
-                            dtype=dtype,
-                            device=torch.cuda.current_device(),
-                            requires_grad=False)
-
-        return self.buffer[(name, dtype)][0:required_len].view(*tensor_shape)
-
-
-def _kernel_make_viewless_tensor(inp, requires_grad):
-    '''
-    Make a viewless tensor.
-
-    View tensors have the undesirable side-affect of retaining a reference
-    to the originally-viewed tensor, even after manually setting the '.data'
-    field. This method creates a new tensor that links to the old tensor's
-    data, without linking the viewed tensor, referenced via the '._base'
-    field.
-    '''
-    out = torch.empty(
-        (1,),
-        dtype=inp.dtype,
-        device=inp.device,
-        requires_grad=requires_grad,
-    )
-    with torch.no_grad():
-        out.set_(inp.data)
-    return out
-
-
-class MakeViewlessTensor(torch.autograd.Function):
-    '''
-    Autograd function to make a viewless tensor.
-
-    This function should be used in cases where the computation graph needs
-    to be propagated, but we only want a viewless tensor (e.g.,
-    ParallelTransformer's hidden_states). Call this function by passing
-    'keep_graph = True' to 'make_viewless_tensor()'.
-    '''
-    @staticmethod
-    def forward(ctx, inp, requires_grad):
-        return _kernel_make_viewless_tensor(inp, requires_grad)
-    
-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output, None
-
-
-def make_viewless_tensor(inp, requires_grad, keep_graph):
-    '''
-    Entry-point for creating viewless tensors.
-
-    This method should be used, rather than calling 'MakeViewlessTensor'
-    or '_kernel_make_viewless_tensor' directly. This method acts as a
-    switch for determining if an autograd function or a regular method
-    should be used to create the tensor.
-    '''
-
-    # return tensor as-is, if not a 'view'
-    if inp._base is None:
-        return inp
-
-    # create viewless tensor
-    if keep_graph:
-        return MakeViewlessTensor.apply(inp, requires_grad)
-    else:
-        return _kernel_make_viewless_tensor(inp, requires_grad)
-
-
-def assert_viewless_tensor(tensor, extra_msg=None):
-    """
-    Assert that a tensor is not a view (i.e., its '._base' field is
-    not set).
-    """
-    if isinstance(tensor, list):
-        [assert_viewless_tensor(t) for t in tensor]
-        return tensor
-    if not isinstance(tensor, torch.Tensor):
-        return tensor
-    ensure_var_is_none(tensor._base, error_message=(
-        "Ensure tensor._base is None before setting tensor.data or storing "
-        "tensor to memory buffer. Otherwise, a memory leak will occur (and "
-        "likely accumulate over iterations). %s"
-    ) % extra_msg)
-    return tensor
-
-
-def safely_set_viewless_tensor_data(tensor, new_data_tensor):
-    '''
-    Safely set tensor's '.data' field.
-
-    Check first that the tensor is viewless (i.e., '._base' not set). If not,
-    raise an exception.
-    '''
-    assert_viewless_tensor(
-        tensor,
-        extra_msg="FYI, tensor._base has shape %s, and new_data_tensor has shape %s."
-        % ("--" if tensor._base is None else tensor._base.shape, new_data_tensor.shape),
-    )
-    tensor.data = new_data_tensor
-
-
-def init_method_normal(sigma):
-    """Init method based on N(0, sigma)."""
-
-    def init_(tensor):
-        return torch.nn.init.normal_(tensor, mean=0.0, std=sigma)
-
-    return init_
-
-
-def scaled_init_method_normal(sigma, num_layers):
-    """Init method based on N(0, sigma/sqrt(2*num_layers)."""
-    std = sigma / math.sqrt(2.0 * num_layers)
-
-    def init_(tensor):
-        return torch.nn.init.normal_(tensor, mean=0.0, std=std)
-
-    return init_
--- a/modellink/data/init.py
+++ b/modellink/data/init.py
--- a/modellink/data/blendable_dataset.py
+++ b/modellink/data/blendable_dataset.py
@ -1,68 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Blendable dataset."""
-
-import time
-
-import numpy as np
-import torch
-
-from modellink import print_rank_0
-from modellink.error_utils import check_equal, ensure_valid
-
-
-class BlendableDataset(torch.utils.data.Dataset):
-
-
-    def __init__(self, datasets, weights):
-
-        self.datasets = datasets
-        num_datasets = len(datasets)
-        check_equal(num_datasets, len(weights))
-
-        self.size = 0
-        for dataset in self.datasets:
-            self.size += len(dataset)
-
-        # Normalize weights.
-        weights = np.array(weights, dtype=np.float64)
-        sum_weights = np.sum(weights)
-        ensure_valid(sum_weights > 0.0)
-        weights /= sum_weights
-
-        # Build indecies.
-        start_time = time.time()
-        ensure_valid(num_datasets < 255)
-        self.dataset_index = np.zeros(self.size, dtype=np.uint8)
-        self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
-
-        from megatron.data import helpers
-        helpers.build_blending_indices(self.dataset_index,
-                                       self.dataset_sample_index,
-                                       weights, num_datasets, self.size,
-                                       torch.distributed.get_rank() == 0)
-        print_rank_0('> elapsed time for building blendable dataset indices: '
-                     '{:.2f} (sec)'.format(time.time() - start_time))
-
-
-    def __len__(self):
-        return self.size
-
-
-    def __getitem__(self, idx):
-        dataset_idx = self.dataset_index[idx]
-        sample_idx = self.dataset_sample_index[idx]
-        return self.datasets[dataset_idx][sample_idx]
--- a/modellink/data/data_handler.py
+++ b/modellink/data/data_handler.py
@ -25,7 +25,7 @@ import torch
 import numpy as np
 from datasets import load_dataset

-from modellink.data import indexed_dataset
+from megatron.core.datasets import indexed_dataset
 from modellink.data.prompter import Prompter, AlpacaTemplate

 logging.basicConfig(level=logging.INFO)
@ -79,9 +79,9 @@ class BaseDatasetHandler(object):
            output_bin_files[key] = f"{self.args.output_prefix}_{key}_{level}.bin"
            output_idx_files[key] = f"{self.args.output_prefix}_{key}_{level}.idx"
            # vocab_size=None : use int32 dtype for -100 will be used in labels
-            builders[key] = indexed_dataset.make_builder(output_bin_files[key],
-                                                         impl=self.args.dataset_impl,
-                                                         vocab_size=None)
+            dtype_ = indexed_dataset.DType.optimal_dtype(self.tokenizer.vocab_size)
+            builders[key] = indexed_dataset.MMapIndexedDatasetBuilder(output_bin_files[key],
+                                                                      dtype=dtype_)

        startup_end = time.time()
        proc_start = time.time()
--- a/modellink/data/data_samplers.py
+++ b/modellink/data/data_samplers.py
@ -1,236 +0,0 @@
-# coding=utf-8
-# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Dataloaders."""
-
-
-import torch
-import random
-
-from transformers import DataCollatorForSeq2Seq
-
-from modellink import get_args, get_tokenizer
-from modellink.core import parallel_state
-from modellink.error_utils import check_divisible, ensure_valid
-
-
-
-def build_pretraining_data_loader(dataset, consumed_samples):
-    """Buld dataloader given an input dataset."""
-
-    if dataset is None:
-        return None
-    args = get_args()
-
-    # modellink sampler
-    if args.dataloader_type == 'single':
-        if args.optimized_pipeline:
-            batch_sampler = DynamicMicroBatchPretrainingSampler(
-                total_samples=len(dataset),
-                consumed_samples=consumed_samples,
-                micro_batch_size=args.micro_batch_size,
-                data_parallel_rank=parallel_state.get_data_parallel_rank(),
-                data_parallel_size=parallel_state.get_data_parallel_world_size())
-        else:
-            batch_sampler = MegatronPretrainingSampler(
-                total_samples=len(dataset),
-                consumed_samples=consumed_samples,
-                micro_batch_size=args.micro_batch_size,
-                data_parallel_rank=parallel_state.get_data_parallel_rank(),
-                data_parallel_size=parallel_state.get_data_parallel_world_size())
-    elif args.dataloader_type == 'cyclic':
-        batch_sampler = MegatronPretrainingRandomSampler(
-            total_samples=len(dataset),
-            consumed_samples=consumed_samples,
-            micro_batch_size=args.micro_batch_size,
-            data_parallel_rank=parallel_state.get_data_parallel_rank(),
-            data_parallel_size=parallel_state.get_data_parallel_world_size())
-    else:
-        raise Exception('{} dataloader type is not supported.'.format(
-                args.dataloader_type))
-
-    tokenizer = get_tokenizer().tokenizer
-
-    if  args.is_instruction_dataset:
-        collator = DataCollatorForSeq2Seq(
-            tokenizer,
-            pad_to_multiple_of=32 if args.variable_seq_lengths else args.seq_length,
-            return_tensors='pt',
-            padding=True
-        )
-    else:
-        collator = None
-
-    # Torch dataloader.
-    return torch.utils.data.DataLoader(dataset,
-                                       batch_sampler=batch_sampler,
-                                       num_workers=args.num_workers,
-                                       generator=torch.Generator().manual_seed(args.seed),
-                                       collate_fn=collator,
-                                       pin_memory=True)
-
-
-class MegatronPretrainingSampler:
-
-    def __init__(self, total_samples, consumed_samples, micro_batch_size,
-                 data_parallel_rank, data_parallel_size, drop_last=True):
-        # Keep a copy of input params for later use.
-        self.total_samples = total_samples
-        self.consumed_samples = consumed_samples
-        self.micro_batch_size = micro_batch_size
-        self.data_parallel_rank = data_parallel_rank
-        self.micro_batch_times_data_parallel_size = \
-            self.micro_batch_size * data_parallel_size
-        self.drop_last = drop_last
-
-        # Sanity checks.
-        ensure_valid(self.total_samples > 0, error_message='no sample' \
-                                             ' to consume: {}'.format(self.total_samples))
-        ensure_valid(self.consumed_samples < self.total_samples, error_message='no samples' \
-                            ' left to consume: {}, {}'.format(self.consumed_samples, self.total_samples))
-        ensure_valid(self.micro_batch_size > 0)
-        ensure_valid(data_parallel_size > 0)
-        ensure_valid(self.data_parallel_rank < data_parallel_size, error_message='data_parallel_rank' \
-                     ' should be smaller than data size: {}, {}'.format(self.data_parallel_rank, data_parallel_size))
-
-    def __len__(self):
-        return self.total_samples
-
-    def get_start_end_idx(self):
-        start_idx = self.data_parallel_rank * self.micro_batch_size
-        end_idx = start_idx + self.micro_batch_size
-        return start_idx, end_idx
-
-    def __iter__(self):
-        batch = []
-        # Last batch will be dropped if drop_last is not set False
-        for idx in range(self.consumed_samples, self.total_samples):
-            batch.append(idx)
-            if len(batch) == self.micro_batch_times_data_parallel_size:
-                start_idx, end_idx = self.get_start_end_idx()
-                yield batch[start_idx:end_idx]
-                batch = []
-
-        # Check the last partial batch and see drop_last is set
-        if len(batch) > 0 and not self.drop_last:
-            start_idx, end_idx = self.get_start_end_idx()
-            yield batch[start_idx:end_idx]
-
-
-class MegatronPretrainingRandomSampler:
-
-    def __init__(self, total_samples, consumed_samples, micro_batch_size,
-                 data_parallel_rank, data_parallel_size):
-        # Keep a copy of input params for later use.
-        self.total_samples = total_samples
-        self.consumed_samples = consumed_samples
-        self.micro_batch_size = micro_batch_size
-        self.data_parallel_rank = data_parallel_rank
-        self.data_parallel_size = data_parallel_size
-        self.micro_batch_times_data_parallel_size = \
-            self.micro_batch_size * data_parallel_size
-        self.last_batch_size = \
-            self.total_samples % self.micro_batch_times_data_parallel_size
-
-        # Sanity checks.
-        ensure_valid(self.total_samples > 0, error_message='no sample' \
-                                             ' to consume: {}'.format(self.total_samples))
-        ensure_valid(self.micro_batch_size > 0)
-        ensure_valid(data_parallel_size > 0)
-        ensure_valid(self.data_parallel_rank < data_parallel_size, error_message='data_parallel_rank' \
-                     ' should be smaller than data size: {}, {}'.format(self.data_parallel_rank, data_parallel_size))
-
-    def __len__(self):
-        return self.total_samples
-
-    def __iter__(self):
-        active_total_samples = self.total_samples - self.last_batch_size
-        self.epoch = self.consumed_samples // active_total_samples
-        current_epoch_samples = self.consumed_samples % active_total_samples
-        check_divisible(current_epoch_samples, self.micro_batch_times_data_parallel_size)
-
-        # data sharding and random sampling
-        bucket_size = (self.total_samples // self.micro_batch_times_data_parallel_size) \
-                       * self.micro_batch_size
-        bucket_offset = current_epoch_samples // self.data_parallel_size
-        start_idx = self.data_parallel_rank * bucket_size
-        
-        g = torch.Generator()
-        g.manual_seed(self.epoch)
-        random_idx = torch.randperm(bucket_size, generator=g).tolist()
-        idx_range = [start_idx + x for x in random_idx[bucket_offset:]]
-
-        batch = []
-        # Last batch if not complete will be dropped.
-        for idx in idx_range:
-            batch.append(idx)
-            if len(batch) == self.micro_batch_size:
-                self.consumed_samples += self.micro_batch_times_data_parallel_size
-                yield batch
-                batch = []
-
-
-class DynamicMicroBatchPretrainingSampler:
-
-    def __init__(self, total_samples, consumed_samples, micro_batch_size,
-                 data_parallel_rank, data_parallel_size, drop_last=True):
-
-        args = get_args()
-        # Keep a copy of input params for later use.
-        self.total_samples = total_samples
-        self.consumed_samples = consumed_samples
-        self.micro_batch_size = micro_batch_size
-        self.data_parallel_rank = data_parallel_rank
-        self.drop_last = drop_last
-        self.dynamic_micro_batch_size = args.manual_mbs
-        self.micro_batch_times_data_parallel_size = [
-            self.dynamic_micro_batch_size[i] * data_parallel_size \
-            for i in range(len(self.dynamic_micro_batch_size))
-        ]
-
-        # Sanity checks.
-        ensure_valid(self.total_samples > 0, error_message='no sample' \
-                                             ' to consume: {}'.format(self.total_samples))
-        ensure_valid(self.consumed_samples < self.total_samples, error_message='no samples' \
-                            ' left to consume: {}, {}'.format(self.consumed_samples, self.total_samples))
-        ensure_valid(self.micro_batch_size > 0)
-        ensure_valid(data_parallel_size > 0)
-        ensure_valid(self.data_parallel_rank < data_parallel_size, error_message='data_parallel_rank' \
-                     ' should be smaller than data size: {}, {}'.format(self.data_parallel_rank, data_parallel_size))
-
-    def __len__(self):
-        return self.total_samples
-
-    def get_start_end_idx(self, n_mbs):
-        start_idx = self.data_parallel_rank * self.dynamic_micro_batch_size[n_mbs]
-        end_idx = start_idx + self.dynamic_micro_batch_size[n_mbs]
-        return start_idx, end_idx
-
-    def __iter__(self):
-        batch = []
-        n_mbs = 0
-        # Last batch will be dropped if drop_last is not set False
-        for idx in range(self.consumed_samples, self.total_samples):
-            batch.append(idx)
-            if len(batch) == self.micro_batch_times_data_parallel_size[n_mbs]:
-                start_idx, end_idx = self.get_start_end_idx(n_mbs)
-                yield batch[start_idx:end_idx]
-                batch = []
-                n_mbs = (n_mbs + 1) % len(self.micro_batch_times_data_parallel_size)
-
-        # Check the last partial batch and see drop_last is set
-        if len(batch) > 0 and not self.drop_last:
-            start_idx, end_idx = self.get_start_end_idx()
-            yield batch[start_idx:end_idx]
--- a/modellink/data/dataset_utils.py
+++ b/modellink/data/dataset_utils.py
@ -1,559 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors, and NVIDIA.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-
-import math
-import os
-import time
-import collections
-
-import numpy as np
-import torch
-from deepspeed.accelerator import get_accelerator
-
-from modellink import (
-    get_args,
-    print_rank_0
-)
-from modellink.core import parallel_state
-from modellink.data.indexed_dataset import make_dataset as make_indexed_dataset
-from modellink.error_utils import check_divisible, check_equal, ensure_valid
-
-
-def get_datasets_weights_and_num_samples(data_prefix,
-                                         train_valid_test_num_samples):
-
-    # The data prefix should be in the format of:
-    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
-    check_divisible(len(data_prefix), 2)
-    num_datasets = len(data_prefix) // 2
-    weights = [0] * num_datasets
-    prefixes = [0] * num_datasets
-    for i in range(num_datasets):
-        weights[i] = float(data_prefix[2 * i])
-        prefixes[i] = (data_prefix[2 * i + 1]).strip()
-    # Normalize weights
-    weight_sum = 0.0
-    for weight in weights:
-        weight_sum += weight
-    ensure_valid(weight_sum > 0.0)
-    weights = [weight / weight_sum for weight in weights]
-
-    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
-    # not uniformly distribute the number of samples, we still have
-    # samples left to feed to the network.
-    datasets_train_valid_test_num_samples = []
-    for weight in weights:
-        datasets_train_valid_test_num_samples.append(
-            [int(math.ceil(val * weight * 1.005))
-             for val in train_valid_test_num_samples])
-
-
-    return prefixes, weights, datasets_train_valid_test_num_samples
-
-
-def get_a_and_b_segments(sample, np_rng):
-    """Divide sample into a and b segments."""
-
-    # Number of sentences in the sample.
-    n_sentences = len(sample)
-    # Make sure we always have two sentences.
-    ensure_valid(n_sentences > 1, error_message='make sure each sample has at least two sentences.')
-
-    # First part:
-    # `a_end` is how many sentences go into the `A`.
-    a_end = 1
-    if n_sentences >= 3:
-        # Note that randin in numpy is exclusive.
-        a_end = np_rng.randint(1, n_sentences)
-    tokens_a = []
-    for j in range(a_end):
-        tokens_a.extend(sample[j])
-
-    # Second part:
-    tokens_b = []
-    for j in range(a_end, n_sentences):
-        tokens_b.extend(sample[j])
-
-    # Random next:
-    is_next_random = False
-    if np_rng.random() < 0.5:
-        is_next_random = True
-        tokens_a, tokens_b = tokens_b, tokens_a
-
-    return tokens_a, tokens_b, is_next_random
-
-
-def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
-    """Truncates a pair of sequences to a maximum sequence length."""
-    #print(len_a, len_b, max_num_tokens)
-    ensure_valid(len_a > 0)
-    if len_a + len_b <= max_num_tokens:
-        return False
-    while len_a + len_b > max_num_tokens:
-        if len_a > len_b:
-            len_a -= 1
-            tokens = tokens_a
-        else:
-            len_b -= 1
-            tokens = tokens_b
-        if np_rng.random() < 0.5:
-            del tokens[0]
-        else:
-            tokens.pop()
-    return True
-
-
-def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
-    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
-
-    tokens = []
-    tokentypes = []
-    # [CLS].
-    tokens.append(cls_id)
-    tokentypes.append(0)
-    # Segment A.
-    for token in tokens_a:
-        tokens.append(token)
-        tokentypes.append(0)
-    # [SEP].
-    tokens.append(sep_id)
-    tokentypes.append(0)
-    # Segment B.
-    for token in tokens_b:
-        tokens.append(token)
-        tokentypes.append(1)
-    if tokens_b:
-        # [SEP].
-        tokens.append(sep_id)
-        tokentypes.append(1)
-
-    return tokens, tokentypes
-
-
-MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
-                                          ["index", "label"])
-
-
-def is_start_piece(piece):
-    """Check if the current word piece is the starting piece (BERT)."""
-    # When a word has been split into
-    # WordPieces, the first token does not have any marker and any subsequence
-    # tokens are prefixed with ##. So whenever we see the ## token, we
-    # append it to the previous set of word indexes.
-    return not piece.startswith("##")
-
-
-def create_masked_lm_predictions(tokens,
-                                 vocab_id_list, vocab_id_to_token_dict,
-                                 masked_lm_prob,
-                                 cls_id, sep_id, mask_id,
-                                 max_predictions_per_seq,
-                                 np_rng,
-                                 max_ngrams=3,
-                                 do_whole_word_mask=True,
-                                 favor_longer_ngram=False,
-                                 do_permutation=False,
-                                 geometric_dist=False,
-                                 masking_style="bert"):
-    """
-    Creates the predictions for the masked LM objective.
-    Note: Tokens here are vocab ids and not text tokens.
-    """
-
-    cand_indexes = []
-    # Note(mingdachen): We create a list for recording if the piece is
-    # the starting piece of current token, where 1 means true, so that
-    # on-the-fly whole word masking is possible.
-    token_boundary = [0] * len(tokens)
-
-    for (i, token) in enumerate(tokens):
-        if token == cls_id or token == sep_id:
-            token_boundary[i] = 1
-            continue
-        # Whole Word Masking means that if we mask all of the wordpieces
-        # corresponding to an original word.
-        #
-        # Note that Whole Word Masking does *not* change the training code
-        # at all -- we still predict each WordPiece independently, softmaxed
-        # over the entire vocabulary.
-        if (do_whole_word_mask and len(cand_indexes) >= 1 and
-                not is_start_piece(vocab_id_to_token_dict[token])):
-            cand_indexes[-1].append(i)
-        else:
-            cand_indexes.append([i])
-            if is_start_piece(vocab_id_to_token_dict[token]):
-                token_boundary[i] = 1
-
-    output_tokens = list(tokens)
-
-    masked_lm_positions = []
-    masked_lm_labels = []
-
-    if masked_lm_prob == 0:
-        return (output_tokens, masked_lm_positions,
-                masked_lm_labels, token_boundary, None)
-
-    num_to_predict = min(max_predictions_per_seq,
-                         max(1, int(round(len(tokens) * masked_lm_prob))))
-
-    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
-    if not geometric_dist:
-        # Note(mingdachen):
-        # By default, we set the probilities to favor shorter ngram sequences.
-        pvals = 1. / np.arange(1, max_ngrams + 1)
-        pvals /= pvals.sum(keepdims=True)
-        if favor_longer_ngram:
-            pvals = pvals[::-1]
-
-    ngram_indexes = []
-    for idx in range(len(cand_indexes)):
-        ngram_index = []
-        for n in ngrams:
-            ngram_index.append(cand_indexes[idx:idx + n])
-        ngram_indexes.append(ngram_index)
-
-    np_rng.shuffle(ngram_indexes)
-
-    (masked_lms, masked_spans) = ([], [])
-    covered_indexes = set()
-    for cand_index_set in ngram_indexes:
-        if len(masked_lms) >= num_to_predict:
-            break
-        if not cand_index_set:
-            continue
-        # Note(mingdachen):
-        # Skip current piece if they are covered in lm masking or previous ngrams.
-        for index_set in cand_index_set[0]:
-            for index in index_set:
-                if index in covered_indexes:
-                    continue
-
-        if not geometric_dist:
-            n = np_rng.choice(ngrams[:len(cand_index_set)],
-                              p=pvals[:len(cand_index_set)] /
-                              pvals[:len(cand_index_set)].sum(keepdims=True))
-        else:
-            # Sampling "n" from the geometric distribution and clipping it to
-            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
-            n = min(np_rng.geometric(0.2), max_ngrams)
-
-        index_set = sum(cand_index_set[n - 1], [])
-        n -= 1
-        # Note(mingdachen):
-        # Repeatedly looking for a candidate that does not exceed the
-        # maximum number of predictions by trying shorter ngrams.
-        while len(masked_lms) + len(index_set) > num_to_predict:
-            if n == 0:
-                break
-            index_set = sum(cand_index_set[n - 1], [])
-            n -= 1
-        # If adding a whole-word mask would exceed the maximum number of
-        # predictions, then just skip this candidate.
-        if len(masked_lms) + len(index_set) > num_to_predict:
-            continue
-        is_any_index_covered = False
-        for index in index_set:
-            if index in covered_indexes:
-                is_any_index_covered = True
-                break
-        if is_any_index_covered:
-            continue
-        for index in index_set:
-            covered_indexes.add(index)
-            masked_token = None
-            if masking_style == "bert":
-                # 80% of the time, replace with [MASK]
-                if np_rng.random() < 0.8:
-                    masked_token = mask_id
-                else:
-                    # 10% of the time, keep original
-                    if np_rng.random() < 0.5:
-                        masked_token = tokens[index]
-                    # 10% of the time, replace with random word
-                    else:
-                        masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
-            elif masking_style == "t5":
-                masked_token = mask_id
-            else:
-                raise ValueError("invalid value of masking style")
-
-            output_tokens[index] = masked_token
-            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
-
-        masked_spans.append(MaskedLmInstance(
-            index=index_set,
-            label=[tokens[index] for index in index_set]))
-
-    ensure_valid(len(masked_lms) <= num_to_predict)
-    np_rng.shuffle(ngram_indexes)
-
-    select_indexes = set()
-    if do_permutation:
-        for cand_index_set in ngram_indexes:
-            if len(select_indexes) >= num_to_predict:
-                break
-            if not cand_index_set:
-                continue
-            # Note(mingdachen):
-            # Skip current piece if they are covered in lm masking or previous ngrams.
-            for index_set in cand_index_set[0]:
-                for index in index_set:
-                    if index in covered_indexes or index in select_indexes:
-                        continue
-
-            n = np.random.choice(ngrams[:len(cand_index_set)],
-                                 p=pvals[:len(cand_index_set)] /
-                                 pvals[:len(cand_index_set)].sum(keepdims=True))
-            index_set = sum(cand_index_set[n - 1], [])
-            n -= 1
-
-            while len(select_indexes) + len(index_set) > num_to_predict:
-                if n == 0:
-                    break
-                index_set = sum(cand_index_set[n - 1], [])
-                n -= 1
-            # If adding a whole-word mask would exceed the maximum number of
-            # predictions, then just skip this candidate.
-            if len(select_indexes) + len(index_set) > num_to_predict:
-                continue
-            is_any_index_covered = False
-            for index in index_set:
-                if index in covered_indexes or index in select_indexes:
-                    is_any_index_covered = True
-                    break
-            if is_any_index_covered:
-                continue
-            for index in index_set:
-                select_indexes.add(index)
-        ensure_valid(len(select_indexes) <= num_to_predict)
-
-        select_indexes = sorted(select_indexes)
-        permute_indexes = list(select_indexes)
-        np_rng.shuffle(permute_indexes)
-        orig_token = list(output_tokens)
-
-        for src_i, tgt_i in zip(select_indexes, permute_indexes):
-            output_tokens[src_i] = orig_token[tgt_i]
-            masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i]))
-
-    masked_lms = sorted(masked_lms, key=lambda x: x.index)
-    # Sort the spans by the index of the first span
-    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
-
-    for p in masked_lms:
-        masked_lm_positions.append(p.index)
-        masked_lm_labels.append(p.label)
-    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)
-
-
-def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
-                             masked_labels, pad_id, max_seq_length):
-    """Pad sequences and convert them to numpy."""
-
-    # Some checks.
-    num_tokens = len(tokens)
-    padding_length = max_seq_length - num_tokens
-    ensure_valid(padding_length >= 0)
-    check_equal(len(tokentypes), num_tokens)
-    check_equal(len(masked_positions, len(masked_labels)))
-
-    # Tokens and token types.
-    filler = [pad_id] * padding_length
-    tokens_np = np.array(tokens + filler, dtype=np.int64)
-    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
-
-    # Padding mask.
-    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length,
-                               dtype=np.int64)
-
-    # Lables and loss mask.
-    labels = [-1] * max_seq_length
-    loss_mask = [0] * max_seq_length
-    for i in range(len(masked_positions)):
-        ensure_valid(masked_positions[i] < num_tokens)
-        labels[masked_positions[i]] = masked_labels[i]
-        loss_mask[masked_positions[i]] = 1
-    labels_np = np.array(labels, dtype=np.int64)
-    loss_mask_np = np.array(loss_mask, dtype=np.int64)
-
-    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
-
-
-def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
-
-    print_rank_0(' > building dataset index ...')
-
-    start_time = time.time()
-    indexed_dataset = make_indexed_dataset(data_prefix,
-                                           data_impl,
-                                           skip_warmup)
-    check_equal(indexed_dataset.sizes.shape[0], indexed_dataset.doc_idx[-1])
-    print_rank_0(' > finished creating indexed dataset in {:4f} '
-                 'seconds'.format(time.time() - start_time))
-
-    print_rank_0(' > indexed dataset stats:')
-    print_rank_0('    number of documents: {}'.format(
-        indexed_dataset.doc_idx.shape[0] - 1))
-    print_rank_0('    number of sentences: {}'.format(
-        indexed_dataset.sizes.shape[0]))
-
-    return indexed_dataset
-
-
-def get_split_by_range_(range_string, size):
-    """ Get dataset splits based on a range:
-    range_string is in the form  START%:END%  for e.g. 0.2:0.8
-    outputs an array of two values [start_index, end_index]
-    """
-    # some checks that range is given in the correct form
-    splits = [float(i) for i in range_string.split(":")]
-    check_equal(len(splits), 2, "splits should be passed as start:end")
-    ensure_valid(splits[0] <= 1 and splits[1] <= 1)
-    splits_sum = sum(splits)
-    ensure_valid(splits_sum > 0.0)
-    splits_index = [round(s * float(size)) for s in splits]
-    check_equal(len(splits_index), 2)
-    return splits_index
-
-
-def get_train_valid_test_split_(splits_string, size):
-    """ Get dataset splits from comma or '/' separated string list."""
-
-    splits = []
-    if splits_string.find(',') != -1:
-        splits = [float(s) for s in splits_string.split(',')]
-    elif splits_string.find('/') != -1:
-        splits = [float(s) for s in splits_string.split('/')]
-    else:
-        splits = [float(splits_string)]
-    while len(splits) < 3:
-        splits.append(0.)
-    splits = splits[:3]
-    splits_sum = sum(splits)
-    ensure_valid(splits_sum > 0.0)
-    splits = [split / splits_sum for split in splits]
-    splits_index = [0]
-    for index, split in enumerate(splits):
-        splits_index.append(splits_index[index] +
-                            int(round(split * float(size))))
-    diff = splits_index[-1] - size
-    for index in range(1, len(splits_index)):
-        splits_index[index] -= diff
-    check_equal(len(splits_index), 4)
-    check_equal(splits_index[-1], size)
-    return splits_index
-
-
-def get_samples_mapping(indexed_dataset,
-                        data_prefix,
-                        num_epochs,
-                        max_num_samples,
-                        max_seq_length,
-                        short_seq_prob,
-                        seed,
-                        name,
-                        binary_head):
-    """Get a list that maps a sample index to a starting sentence index, end sentence index, and length"""
-    args = get_args()
-    if args.train_data_exact_num_epochs is not None and name == 'train':
-        num_epochs = args.train_data_exact_num_epochs
-        max_num_samples = np.iinfo(np.int64).max - 1
-    else:
-        if not num_epochs:
-            if not max_num_samples:
-                raise ValueError("Need to specify either max_num_samples "
-                                "or num_epochs")
-            num_epochs = np.iinfo(np.int32).max - 1
-        if not max_num_samples:
-            max_num_samples = np.iinfo(np.int64).max - 1
-
-    # Filename of the index mapping
-    indexmap_filename = data_prefix
-    indexmap_filename += '_{}_indexmap'.format(name)
-    if args.train_data_exact_num_epochs is not None and name == 'train':
-        indexmap_filename += '_exact{}ep'.format(num_epochs)
-    else:
-        if num_epochs != (np.iinfo(np.int32).max - 1):
-            indexmap_filename += '_{}ep'.format(num_epochs)
-        if max_num_samples != (np.iinfo(np.int64).max - 1):
-            indexmap_filename += '_{}mns'.format(max_num_samples)
-    indexmap_filename += '_{}msl'.format(max_seq_length)
-    indexmap_filename += '_{:0.2f}ssp'.format(short_seq_prob)
-    indexmap_filename += '_{}s'.format(seed)
-    indexmap_filename += '.npy'
-
-    if name == 'train':
-        # force to use certain index files
-        if args.train_idx_path is not None:
-            indexmap_filename = args.train_idx_path
-
-    # Build the indexed mapping if not exist.
-    if torch.distributed.get_rank() == 0 and \
-       not os.path.isfile(indexmap_filename):
-        print(' > WARNING: could not find index map file {}, building '
-              'the indices on rank 0 ...'.format(indexmap_filename))
-
-        # Make sure the types match the helpers input types.
-        check_equal(indexed_dataset.doc_idx.dtype, np.int64)
-        check_equal(indexed_dataset.sizes.dtype, np.int32)
-
-        # Build samples mapping
-        verbose = torch.distributed.get_rank() == 0
-        start_time = time.time()
-        print_rank_0(' > building sapmles index mapping for {} ...'.format(
-            name))
-        # First compile and then import.
-        from megatron.data import helpers
-        samples_mapping = helpers.build_mapping(
-            indexed_dataset.doc_idx,
-            indexed_dataset.sizes,
-            num_epochs,
-            max_num_samples,
-            max_seq_length,
-            short_seq_prob,
-            seed,
-            verbose,
-            2 if binary_head else 1)
-        print_rank_0(' > done building sapmles index maping')
-        np.save(indexmap_filename, samples_mapping, allow_pickle=True)
-        print_rank_0(' > saved the index mapping in {}'.format(
-            indexmap_filename))
-        # Make sure all the ranks have built the mapping
-        print_rank_0(' > elasped time to build and save samples mapping '
-                     '(seconds): {:4f}'.format(
-                         time.time() - start_time))
-    # This should be a barrier but nccl barrier assumes
-    # device_index=rank which is not the case for model
-    # parallel case
-    if get_accelerator().device_count() > 0: # Skip when CPU-only
-        counts = get_accelerator().LongTensor([1])
-        torch.distributed.all_reduce(counts, group=parallel_state.get_data_parallel_group())
-        torch.distributed.all_reduce(counts, group=parallel_state.get_pipeline_model_parallel_group())
-        item = (torch.distributed.get_world_size() // 
-                torch.distributed.get_world_size(group=parallel_state.get_tensor_model_parallel_group()))
-        check_equal(counts[0].item(), item)
-    # Load indexed dataset.
-    print_rank_0(' > loading indexed mapping from {}'.format(
-        indexmap_filename))
-    start_time = time.time()
-    samples_mapping = np.load(indexmap_filename, allow_pickle=True, mmap_mode='r')
-    print_rank_0('    loaded indexed file in {:3.3f} seconds'.format(
-        time.time() - start_time))
-    print_rank_0('    total number of samples: {}'.format(
-        samples_mapping.shape[0]))
-
-    return samples_mapping
--- a/Show More
+++ b/Show More