测试PR2024

Signed-off-by: WuSheYu03 <2893251844@qq.com>
测试PR
2024-07-04 12:13:39 +00:00 · 2024-07-03 19:38:11 +08:00 · 2024-06-24 13:29:15 +00:00 · 2024-06-24 07:11:13 +00:00 · 2024-06-03 02:13:09 +00:00 · 2024-05-23 01:09:13 +00:00
420 changed files with 8685 additions and 30050 deletions
--- a/.gitignore
+++ b/.gitignore
@ -142,12 +142,3 @@ cython_debug/

 # pycharm stuff
 .idea
-
-# megatron core
-/megatron/
-
-# User stuff
-/kernel*/
-/logs/
-/model_from_hf/
-/model_weights/
--- a/51
+++ b/51
@ -1,23 +1,44 @@
 approvers:
- fengliangjun66
- wujianping996
+- gp513
 - dilililiwhy
-reviewers:
+- wujianping996
 - fengliangjun66
 - guhangsong
- leizhenzhen
- liuyanghan
- jzh
- guo-xinjie-1
- wucong0509
- dhrhank
- glhyy
- RyanAlexander
- sunjunjie1587
- dingzicha
- zhangjianxiang4
- guozhihua9814
+- ckirchhoff
+- block0219
+- wqh17101
+- timeash
+reviewers:
+- gp513
+- guhangsong
+- sanjaychan
+- fengliangjun66
+- mhh001
 - huangyiming123
 - zhangshengdong
+- kingsleyandher
+- guo-xinjie-1
+- matrixssy
+- ckirchhoff
+- block0219
+- wqh17101
+- ZeroFadeAway
+- Ronnie_zheng
+- sibylk
+- dl9864
+- demo-v
+- xulianhao
+- yangyifei626
+- z2fan
+- wenjiang2357
+- leizhenzhen
+- liuyanghan
+- Ares_Lzk
+- flying-artillery
 - xiong-liangcheng_admin
+- gitee-code-template
+- yaojia2021
+- chantcalf
+- kongfuziyue
 - yuhui69
+- guoyiwei111
--- a/README.md
+++ b/README.md
--- a/README_en.md
+++ b/README_en.md
--- a/SECURITYNOTE.md
+++ b/SECURITYNOTE.md
@ -1,70 +0,0 @@
-# 安全声明
-## 系统安全加固
-1. 建议用户在系统中配置开启ASLR（级别2），又称**全随机地址空间布局随机化**，可参考以下方式进行配置：
-    ```
-    echo 2 > /proc/sys/kernel/randomize_va_space
-    ```
-
-## 运行用户建议
-出于安全性及权限最小化角度考虑，不建议使用root等管理员类型账户使用Modellink。
-
-## 文件权限控制
-1. 建议用户在主机（包括宿主机）及容器中设置运行系统umask值为0027及以上，保障新增文件夹默认最高权限为750，新增文件默认最高权限为640。
-2. 建议用户对个人数据、商业资产、源文件、训练过程中保存的各类文件等敏感内容做好权限管控。涉及场景如Modellink安装目录权限管控、多用户使用共享数据集权限管控，管控权限可参考表1进行设置。
-3. Modellink在数据预处理中会生成训练数据，在训练过程会生成权重文件，文件权限默认640，用户可根据实际需求对生成文件权限进行进阶管控。
-
-**表1 文件（夹）各场景权限管控推荐最大值**
-| 类型          | linux权限参考最大值 |
-| --------------- | --------------------|
-| 用户主目录                          |    750（rwxr-x---）                |
-| 程序文件（含脚本文件、库文件等）      |    550（r-xr-x---）                |
-| 程序文件目录                        |    550（r-xr-x---）                |
-| 配置文件                            |    640（rw-r-----）                |
-| 配置文件目录                        |    750（rwxr-x---）                |
-| 日志文件（记录完毕或者已经归档）      |    440（r--r-----）                |
-| 日志文件（正在记录）                 |    640（rw-r-----）                |
-| 日志文件记录                        |    750（rwxr-x---）                |
-| Debug文件                          |    640（rw-r-----）                |
-| Debug文件目录                      |    750 (rwxr-x---)                 |
-| 临时文件目录                       |     750（rwxr-x---）                |
-| 维护升级文件目录                    |    770（rwxrwx---）                |
-| 业务数据文件                       |     640（rw-r-----）                |
-| 业务数据文件目录                   |     750（rwxr-x---）                |
-| 密钥组件、私钥、证书、密文文件目录   |     700（rwx------）                |
-| 密钥组件、私钥、证书、加密密文      |     600（rw-------）                |
-| 加解密接口、加解密脚本             |     500（r-x------）                |
-
-
-
-## 数据安全声明
-
-1. ModelLink会在megatron中的checkpointing模块中保存模型文件，其中部分模型文件使用了风险模块pickle，可能存在数据风险。
-
-
-## 运行安全声明
-
-1. 建议用户结合运行资源状况编写对应训练脚本。若训练脚本与资源状况不匹配，如数据集加载内存大小超出内存容量限制、训练脚本在本地生成数据超过磁盘空间大小等情况，可能引发错误并导致进程意外退出。
-2. ModelLink内部用到了pytorch,可能会因为版本不匹配导致运行错误，具体可参考pytorch[安全声明](https://gitee.com/ascend/pytorch#%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)。
-
-
-## 公网地址声明
-
-| 类型     | 开源代码地址                                                                                                         | 文件名                                                                | 公网IP地址/公网URL地址/域名/邮箱地址                                                                                                                     | 用途说明      |
-|--------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|-----------|
-| 开源代码引入 | 不涉及                                                                                                            | modellink/model/language_model.py:85                                                        | https://github.com/kingoflolz/mesh-transformer-jax/                                                                          | 详情地址      |
-| 开源代码引入 | 涉及                                                                                                            | tests/pipeline/common.py:6                                                        | https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/common.py                                                                        | 源代码地址      |
-| 开源代码引入 | 涉及                                                                                                            | tests/pipeline/conftest.py:6                                                        | https://github.com/microsoft/DeepSpeed/blob/master/tests/conftest.py                                                                          | 源代码地址      |
-
-
-## 公开接口声明
-ModelLink 暂时未发布wheel包，无正式对外公开接口，所有功能均通过shell脚本调用。5个入口脚本分别为[pretrain_gpt.py](https://gitee.com/ascend/ModelLink/blob/master/pretrain_gpt.py), [inference.py](https://gitee.com/ascend/ModelLink/blob/master/inference.py), [evaluation.py](https://gitee.com/ascend/ModelLink/blob/master/evaluation.py), [preprocess_data.py](https://gitee.com/ascend/ModelLink/blob/master/tools/preprocess_data.py) 和 [convert_ckpt.py](https://gitee.com/ascend/ModelLink/blob/master/tools/checkpoint/convert_ckpt.py)。
-
-
-## 通信安全加固
-
-[通信安全加固说明](https://gitee.com/ascend/pytorch/blob/master/SECURITYNOTE.md#%E9%80%9A%E4%BF%A1%E5%AE%89%E5%85%A8%E5%8A%A0%E5%9B%BA
-)
-
-## 通信矩阵
-
-[通信矩阵说明](https://gitee.com/ascend/pytorch/blob/master/SECURITYNOTE.md#%E9%80%9A%E4%BF%A1%E7%9F%A9%E9%98%B5%E4%BF%A1%E6%81%AF)
--- a/ci/access_control_test.py
+++ b/ci/access_control_test.py
@ -1,76 +1,55 @@
 import os
 import stat
 import sys
+import unittest
 from pathlib import Path
+import xmlrunner


 # =============================
 # ST test, run with shell
 # =============================
+def success_check(res):
+    if res != 0:
+        sys.exit(1)

-class UT_Test:

-    def __init__(self):
-
-        base_dir = Path(__file__).absolute().parent.parent
-        test_dir = os.path.join(base_dir, 'tests')
-        self.ut_file = os.path.join(test_dir, "ut")
-    
-    def run_ut(self):
-        command = f"python3.8 -m pytest -k 'not allocator' {self.ut_file}"
-        ut_exitcode = os.system(command)
-        if ut_exitcode == 0:
-            print("UT test success")
-        else:
-            print("UT failed")
-            exit(1)
+def success_check_ut(res):
+    if len(res.failures) + len(res.errors) != 0:
+        sys.exit(1)


 class ST_Test:
    
    def __init__(self):

-        base_dir = Path(__file__).absolute().parent.parent
-        test_dir = os.path.join(base_dir, 'tests')
+        BASE_DIR = Path(__file__).absolute().parent.parent
+        TEST_DIR = os.path.join(BASE_DIR, 'tests')

        st_dir = "st"
-        llama_pretrain_shell_file = os.path.join(
-            test_dir, st_dir, "test_llama_pretrain_ptd.sh")
-        llama_inference_shell_file = os.path.join(
-            test_dir, st_dir, "test_llama_inference_ptd.sh")
-        gemma_pretrain_shell_file = os.path.join(
-            test_dir, st_dir, "test_gemma_pretrain_ptd.sh")
-        gemma_inference_shell_file = os.path.join(
-            test_dir, st_dir, "test_gemma_inference_ptd.sh")
-        llama_vpp_pretrain_shell_file = os.path.join(
-            test_dir, st_dir, "test_llama_vpp_pretrain_ptd.sh")
-        llama_instruction_shell_file = os.path.join(
-            test_dir, st_dir, "test_llama_instruction_ptd.sh")
+        llama_dir = "test_llama"
+        bloom_dir = "test_bloom"

-        self.st_file_list = [
-            llama_pretrain_shell_file,
-            llama_inference_shell_file,
-            gemma_pretrain_shell_file,
-            gemma_inference_shell_file,
-            llama_vpp_pretrain_shell_file,
-            llama_instruction_shell_file
+        bloom_shell_file = os.path.join(
+            TEST_DIR, st_dir, bloom_dir, "test_bloom_ptd.sh")
+        llama_shell_file = os.path.join(
+            TEST_DIR, st_dir, llama_dir, "test_llama_ptd.sh")
+        lora_shell_file = os.path.join(
+            TEST_DIR, st_dir, llama_dir, "test_lora_llama_ptd.sh")
+        llama_inference_shell_file = os.path.join(
+            TEST_DIR, st_dir, llama_dir, "test_llama_inference_ptd.sh")
+
+        # 待新ST上来再恢复
+        self.shell_file_list = [
+            # llama_inference_shell_file,
+            # llama_shell_file,
+            # bloom_shell_file,
+            # lora_shell_file,
        ]

-    def run_st(self):
-        all_success = True
-        for shell_file in self.st_file_list:
-            command = f"sh {shell_file}"
-            st_exitcode = os.system(command)
-            if st_exitcode != 0:
-                all_success = False
-                print(f"ST run {shell_file} failed")
-                break
-
-        if all_success:
-            print("ST test success")
-        else:
-            print("ST failed")
-            exit(1)
+    def run_shell(self):
+        for shell_file in self.shell_file_list:
+            success_check(os.system("sh {}".format(shell_file)))


 # ===============================================
@ -79,7 +58,9 @@ class ST_Test:


 if __name__ == "__main__":
-    ut = UT_Test()
-    ut.run_ut()
-    st = ST_Test()
-    st.run_st()
+    st_test = ST_Test()
+    st_test.run_shell()
+    test_loader = unittest.TestLoader()
+    discover = test_loader.discover(start_dir="../tests/ut", pattern="test*.py")
+    runner = unittest.TextTestRunner()
+    success_check_ut(runner.run(discover))
--- a/examples/aquila/README.md
+++ b/examples/aquila/README.md
@ -1,4 +1,4 @@
-# Aquila-7B
+# Aquila-7B  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
 <p align="left">
        <b>简体中文</b> |
        <b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/bloom/README_en.md">English</a> </b> 
@ -24,146 +24,148 @@ Aquila-7B 训练的硬件配置如下:

 ### 脚本

-1. 克隆仓库到本地服务器并切换到modellink分支
+1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建conda环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    # 通过互联网上提供的pip源安装 torch，可能需要尝试合适的包含这个torch==2.1.0版本的pip源
-    pip install torch==2.1.0
-    # 通过PTA上提供的安装包，以whl文件方式安装aarch64架构上的2.1.0版本的torch_npu
-    pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
-    # 通过PTA上提供的安装包，以whl文件方式安装apex
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    # 拉取MindSpeed源代码，进入MindSpeed目录，然后源码方式安装mindspeed加速包
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-    
-    # 安装其余依赖包
-    pip install -r requirements.txt
-    ```
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test
+# 通过互联网上提供的pip源安装 torch，可能需要尝试合适的包含这个torch==2.1.0版本的pip源
+pip install torch==2.1.0
+# 通过PTA上提供的安装包，以whl文件方式安装aarch64架构上的2.1.0版本的torch_npu
+pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
+# 通过PTA上提供的安装包，以whl文件方式安装apex
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# 拉取MindSpeed源代码，进入MindSpeed目录，然后源码方式安装mindspeed加速包
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed/
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt
+pip install -e .
+cd ..
+
+# 安装其余依赖包
+pip install -r requirements.txt
+```

 3. 使用浏览器下载 [Aquila-7B模型的配置，tokenizer，和预训练权重](https://huggingface.co/BAAI/Aquila-7B/tree/main)

-    保存在 ModelLink/model_from_hf/Aquila-7B/ 目录。
+保存在 ModelLink/model_from_hf/Aquila-7B/ 目录。

 4. 数据预处理

-    第一步，使用浏览器 [下载数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)， 保存在ModelLink/dataset/ 目录
+第一步，使用浏览器 [下载数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)， 保存在ModelLink/dataset/ 目录

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    第二步，使用Aquila-7B指定的tokenizer处理数据集：
+第二步，使用Aquila-7B指定的tokenizer处理数据集：

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    mkdir ./dataset/Aquila-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
-        --output-prefix ./dataset/Aquila-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+mkdir ./dataset/Aquila-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
+    --output-prefix ./dataset/Aquila-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF
+```

 5. 权重转换

-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila-7B/ \
-        --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --load-dir ./model_from_hf/Aquila-7B/ \
+    --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Aquila-7B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --save-dir ./model_from_hf/Aquila-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Aquila-7B/mg2hg/
+```

 6. 配置 Aquila-7B 预训练脚本

-    需要在预训练脚本中配置相关参数
+需要在预训练脚本中配置相关参数

-    ```shell
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer 路径
-    DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #数据集 路径
-    CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
-    CKPT_SAVE_DIR="./ckpt/Aquila-7B/"
-    # 如果不需要保存权重，就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
-    # 如果需要保存权重，则需要设置CKPT_SAVE_DIR, 并且启动脚本里应使用 `--save $CKPT_SAVE_DIR` 进行类似配置。
-    
-    # 如果不需要加载权重，就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
-    # 如果需要加载权重，则需要设置CKPT_LOAD_DIR, 并且启动脚本里应使用 `--load $CKPT_LOAD_DIR` 进行类似配置。
-    # 进行断点续训时，应先按以上save的场景配置，待完成ckpt保存后，再修改相应参数，按以上load的场景加载已保存的ckpt。
-    ```
+```shell
+# 根据实际情况配置词表、数据集、模型参数保存路径
+TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer 路径
+DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #数据集 路径
+CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
+CKPT_SAVE_DIR="./ckpt/Aquila-7B/"
+# 如果不需要保存权重，就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
+# 如果需要保存权重，则需要设置CKPT_SAVE_DIR, 并且启动脚本里应使用 `--save $CKPT_SAVE_DIR` 进行类似配置。

-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+# 如果不需要加载权重，就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
+# 如果需要加载权重，则需要设置CKPT_LOAD_DIR, 并且启动脚本里应使用 `--load $CKPT_LOAD_DIR` 进行类似配置。
+# 进行断点续训时，应先按以上save的场景配置，待完成ckpt保存后，再修改相应参数，按以上load的场景加载已保存的ckpt。
+```

 7. 启动 Aquila-7B 预训练脚本

-    运行预训练脚本前，需先执行set_env.sh脚本以便设置环境参数，或者也可将其放入预训练脚本中执行。
+运行预训练脚本前，需先执行set_env.sh脚本以便设置环境参数，或者也可将其放入预训练脚本中执行。

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```

-    按以下方式启动Aquila-7B预训练：
+按以下方式启动Aquila-7B预训练：

-    ```shell
-    bash examples/aquila/pretrain_aquila_7b_ptd.sh
-    ```
+```shell
+bash examples/aquila/pretrain_aquila_7b_ptd.sh
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -171,10 +173,10 @@ Aquila-7B 训练的硬件配置如下:

 Aquila-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

-| 设备 | 模型       | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
-|------|------------|------|------------------------|----------------------|
-| NPU  | Aquila-7B  | 1000 | 2849                  | 5.75                  | 
-| 参考 | Aquila-7B  | 1000 | 2874                   |    5.70               | 
+| 设备 | 硬件           | 模型       | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
+|------|---------------|------------|------|------------------------|----------------------|
+| NPU  | 910b 1node*8p | Aquila-7B  | 1000 | 2849                  | 5.75                  | 
+| 参考  |              | Aquila-7B  | 1000 | 2874                   |    5.70               | 



@ -184,7 +186,7 @@ Aquila-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

 推理与预训练不同，我们必须加载预训练权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。

-权重转换完成后，我们配置Aquila-7B推理脚本`example/aquila/generate_aquila_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径等（下面样例仅供参考）
+权重转换完成后，我们配置Aquila-7B推理脚本`tasks/inference/generate_aquila_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径等（下面样例仅供参考）

 ```shell
 # 请按实际情况修改模型权重路径和分词器路径
@ -195,14 +197,14 @@ TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
 启动Aquila-7B推理:

 ```shell
-bash examples/aquila/generate_aquila_7b_ptd.sh
+bash ./tasks/inference/generate_aquila_7b_ptd.sh
 ```

-部分推理样例如下：
+部分推理样本如下：

 Aquila-7B:

-![aquila-7B_generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila/aquila_7B_generate_ptd_0205.png)
+![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate_ptd_0205.png)

 ## 评估

@ -210,7 +212,7 @@ Aquila-7B:

 评估与推理类似，也必须加载转换后的权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。

-权重转换完成后，我们配置Aquila-7B评估脚本 `examples/aquila/evaluate_aquila_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径，评估数据的路径，以及评估任务的名字等(下面样例仅供参考)：
+权重转换完成后，我们配置Aquila-7B评估脚本 `tasks/evaluation/evaluate_aquila_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径，评估数据的路径，以及评估任务的名字等(下面样例仅供参考)：

 ```shell
 CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
@ -222,7 +224,7 @@ TASK="boolq"
 启动Aquila-7B评估

 ```shell
-bash examples/aquila/evaluate_aquila_7b_ptd.sh
+bash tasks/evaluation/evaluate_aquila_7b_ptd.sh
 ```

 Aquila-7B在**Ascend NPU**中的评测表现：
--- a/examples/aquila/README_en.md
+++ b/examples/aquila/README_en.md
@ -1,4 +1,4 @@
-# Aquila-7B
+# Aquila-7B  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 

 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
@ -26,140 +26,141 @@ Here's a hardware summary of pre-training Aquila-7B:

 1. Clone the repository to your local server and switch to modellink branch:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build conda environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    # install torch, torch_npu and apex
-    pip install torch==2.1.0
-    pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test
+# install torch, torch_npu and apex
+pip install torch==2.1.0
+pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # source the set_env.sh file based on your host settings(you may need to change the path)
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    # use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
+# source the set_env.sh file based on your host settings(you may need to change the path)
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# git clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed/
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt
+pip install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt
-    ```
+# install other packages
+pip install -r requirements.txt
+```

 3. Download the Aquila-7B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila-7B/tree/main)

-    save to ModelLink/model_from_hf/Aquila7B/ directory.
+save to ModelLink/HF_Aquila7B_downloaded/ directory.

+4. Prepare dataset.

-    Prepare dataset.
-    
-    step1: Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.
-    
-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
-    
-    step2: use Aquila-7B specified tokenizer to pre-process data:
-    
-    ```shell
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    mkdir ./dataset/Aquila-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
-        --output-prefix ./dataset/Aquila-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF
-    ```
+step1: Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.

-4. Weights convert
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    HuggingFace weights --> Megatron weights
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+step2: use Aquila-7B specified tokenizer to pre-process data:

-    ```shell
-    # please modify the path to set_env.sh based on your environment.
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila-7B/ \
-        --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
-    ```
+```shell
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+mkdir ./dataset/Aquila-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
+    --output-prefix ./dataset/Aquila-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+5. Weights convert

-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/
-    ```
+HuggingFace weights --> Megatron weights
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-5. Config Aquila-7B pre-training script.
+```shell
+# please modify the path to set_env.sh based on your environment.
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    Config the environment variables in aquila pretrain script
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --load-dir ./model_from_hf/Aquila-7B/ \
+    --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
+```

-    ```shell
-    # set dataset path, CKPT load path for loading weights, and the tokenizer path
-    TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer path
-    DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"   # pointing to the converted model weights
-    CKPT_SAVE_DIR="./ckpt/Aquila-7B/"                   # pointing to the path to save checkpoints
-    ```
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    *Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
-    *If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save $CKPT_SAVE_DIR` parameter from the training script, and vice versa*
-    *When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --save-dir ./model_from_hf/Aquila-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/
+```

-6. Launch Aquila-7B pre-training script.
+6. Config Aquila-7B pre-training script.

-    Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
+Config the environment variables in aquila pretrain script

-    ```shell
-    # you may need to change the path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    ```
+```shell
+# set dataset path, CKPT load path for loading weights, and the tokenizer path
+TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer path
+DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #processed dataset
+CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"   # pointing to the converted model weights
+CKPT_SAVE_DIR="./ckpt/Aquila-7B/"                   # pointing to the path to save checkpoints
+```

-    Start pre-training Aquila-7B model:
+*Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
+*If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save $CKPT_SAVE_DIR` parameter from the training script, and vice versa*
+*When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*

-    ```shell
-    bash examples/aquila/pretrain_aquila_7b_ptd.sh
-    ```
+7. Launch Aquila-7B pre-training script.

-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
+
+```shell
+# you may need to change the path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```
+
+Start pre-training Aquila-7B model:
+
+```shell
+bash examples/aquila/pretrain_aquila_7b_ptd.sh
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -167,16 +168,16 @@ Here's a hardware summary of pre-training Aquila-7B:

 The performance of Aquila-7B in Ascend NPU and reference device:

-| Device    | Model     | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
-| --------- | --------- | ---------- | ---------------------------- | ----------------------------------- |
-| NPU       | Aquila-7B | 1000       | 2849                         | 5.75                                |
-| Reference | Aquila-7B | 1000       | 2874                         | 5.70                                |
+| Device    | Hardware      | Model     | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
+| --------- | ------------- | --------- | ---------- | ---------------------------- | ----------------------------------- |
+| NPU       | 910b 1node*8p | Aquila-7B | 1000       | 2849                         | 5.75                                |
+| Reference |               | Aquila-7B | 1000       | 2874                         | 5.70                                |

 ## Inference

 We support MindSpeed Inference for text generation with Aquila 7B model.

-Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script `examples/aquila/generate_aquila_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila-7B/". In your operation, please fill in correct value based on your actual scenario.
+Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script `tasks/inference/generate_aquila_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./HF_Aquila7B_downloaded". In your operation, please fill in correct value based on your actual scenario.

 ```shell
 # please change to actual values
@ -187,12 +188,12 @@ TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
 Start Aquila-7B Inference:

 ```shell
-bash ./examples/aquila/generate_aquila_7b_ptd.sh
+bash ./tasks/inference/generate_aquila_7b_ptd.sh
 ```

 Sample results of Aquila-7B Inference:

-![aquila-7B_generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila/aquila_7B_generate.png)
+![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate.png)

 ## Evaluation with Benchmark

@ -200,7 +201,7 @@ We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark

 Evaluation task is similar to inference task too，it also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.

-After weight conversion is complete, we configure the Aquila-7B evaluation script `examples/aquila/evaluate_aquila_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
+After weight conversion is complete, we configure the Aquila-7B evaluation script `tasks/evaluation/evaluate_aquila_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)

 ```shell
 CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
@ -212,7 +213,7 @@ TASK="boolq"
 Start evaluation task

 ```shell
-bash ./examples/aquila/evaluate_aquila_7b_ptd.sh
+bash ./tasks/evaluation/evaluate_aquila_7b_ptd.sh
 ```

 Sample Aquila-7B performance running in **Ascend NPU**:
--- a/examples/aquila/pretrain_aquila_7b_ptd.sh
+++ b/examples/aquila/pretrain_aquila_7b_ptd.sh
@ -3,6 +3,7 @@
 # See README, please remember to source the set_env.sh file in CLI, or here
 # source /path/to/your/ascend-toolkit/set_env.sh
 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 CKPT_SAVE_DIR="your checkpoint save dir"
 DATA_PATH="your training data dir"
@ -86,6 +87,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --load $CKPT_LOAD_DIR \
    | tee logs/train_aquila_7b_ptd.log
--- a/examples/aquila2/README.md
+++ b/examples/aquila2/README.md
@ -1,524 +0,0 @@
-# Aquila2  $\color{black}{\bf\tiny{【社区贡献模型】}}$
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-    </p>
-</p>
-
- [Aquila2-7B](#7b)
-  - [训练](#7b-training)
-    - [脚本](#7b-script)
-    - [性能](#7b-performance)
-      - [吞吐](#7b-throughput)
-  - [推理](#7b-inference)
-  - [评估](#7b-evaluation)
-
- [Aquila2-34B](#34b)
-  - [训练](#34b-training)
-    - [脚本](#34b-script)
-    - [性能](#34b-performance)
-      - [吞吐](#34b-throughput)
-  - [推理](#34b-inference)
-  - [评估](#34b-evaluation)
-
-<h1 id="7b">Aquila2-7B</h1>
-<h2 id="7b-training">训练</h2>
-
-Aquila2-7B 训练的硬件配置如下:
-
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 8 x Ascend NPUs |
-
-<h3 id="7b-script">脚本</h3>
-
-1. 克隆仓库到本地服务器并切换到modellink分支
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. 搭建conda环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # 通过 PTA 包提供的 whl 安装 torch、torch_npu 和 apex，例如：
-    pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
-
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # 安装其余依赖包
-    pip install -r requirements.txt
-    ```
-
-3. 使用浏览器下载 [Aquila2-7B模型的配置，tokenizer，和预训练权重](https://huggingface.co/BAAI/Aquila2-7B/tree/main)
-
-    保存在 ModelLink/model_from_hf/Aquila2-7B/ 目录。
-
-4. 权重转换
-
-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila2-7B/ \
-        --save-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --tokenizer-model ./model_from_hf/Aquila2-7B/tokenizer.json
-    ```
-
-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila2-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Aquila2-7B/mg2hg/
-    ```
-
-    权重转换适用于预训练、微调、推理和评估，根据任务不同调整参数 `target-tensor-parallel-size`和 `target-pipeline-parallel-size`。
-
-5. 预训练
-
-    5.1 准备数据集
-
-    下载 Aquila2-7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # 处理数据   
-    mkdir ./dataset/Aquila2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
-        --output-prefix ./dataset/Aquila2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 预训练
-
-    配置 Aquila2-7B 训练脚本: examples/aquila2/pretrain_aquila2_7b_ptd.sh
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"  #tokenizer 路径
-    DATA_PATH="./dataset/Aquila2-7B/alpaca_text_document"  #数据集 路径
-    CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-    CKPT_SAVE_DIR="./ckpt/Aquila2-7B/"
-    ```
-
-    - 如果不需要加载权重，就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
-    - 如果不需要保存权重，就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
-    - 进行断点续训时，应先按以上save的场景配置，待完成ckpt保存后，再修改相应参数，按以上load的场景加载已保存的ckpt。
-
-    启动 Aquila2-7B 预训练脚本: examples/aquila2/pretrain_aquila2_7b_ptd.sh
-
-    ```shell
-    bash examples/aquila2/pretrain_aquila2_7b_ptd.sh
-    ```
-
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-6. 微调
-
-    6.1 准备微调数据集
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/Aquila2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
-        --output-prefix ./finetune_dataset/Aquila2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-    全参微调的配置脚本基本和预训练脚本 `pretrain_aquila2_7b_ptd.sh` 一致. *区别是数据集，以及增加训练参数`--is-instruction-dataset`*
-
-    增加微调参数`--finetune`，使微调从第一步开始。
-
-    ```bash
-    DATA_PATH="./finetune_dataset/Aquila2-7B/alpaca"
-    CKPT_LOAD_DIR="./ckpt/Aquila2-7B/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-<h3 id="7b-performance">性能</h3>
-
-<h4 id="7b-throughput">吞吐</h4>
-
-Aquila2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 | 模型       | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
-|------|------------|------|------------------------|----------------------|
-| NPU  | Aquila2-7B  | 5000 | 3323                  | 4.93                  | 
-| 参考  | Aquila2-7B  | 5000 | 2673                  |    6.13               | 
-
-<h2 id="7b-inference">推理</h2>
-
-我们支持使用 Aquila2-7B进行文本生成的推理。
-
-推理与预训练不同，我们必须加载预训练权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。
-
-权重转换完成后，我们配置Aquila2-7B推理脚本`examples/aquila2/generate_aquila2_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径等（下面样例仅供参考）
-
-```shell
-# 请按实际情况修改模型权重路径和分词器路径
-CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
-```
-
-启动Aquila2-7B推理:
-
-```shell
-bash examples/aquila2/generate_aquila2_7b_ptd.sh
-```
-
-部分推理样例如下：
-
-Aquila2-7B:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-7b-generate.png)
-
-<h2 id="7b-evaluation">评估</h2>
-
-我们使用 BoolQ benchmark 来评估我们的模型。在[Benchmark下载页面](https://github.com/google-research-datasets/boolean-questions)找到[数据集](https://storage.cloud.google.com/boolq/dev.jsonl)下载后保存。例如，保存在ModelLink/boolq/test目录下。
-
-评估与推理类似，也必须加载转换后的权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。
-
-权重转换完成后，我们配置Aquila2-7B评估脚本 `examples/aquila2/evaluate_aquila2_7b_ptd.sh`，需要正确指定加载权重的路径，词表路径，评估数据的路径，以及评估任务的名字等(下面样例仅供参考)：
-
-```shell
-CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
-EVAL_DATA_PATH="./boolq/test"
-TASK="boolq"
-```
-
-启动Aquila2-7B评估
-
-```shell
-bash examples/aquila2/evaluate_aquila2_7b_ptd.sh
-```
-
-Aquila2-7B在**Ascend NPU**中的评测表现：
-
-| 任务                                                                   | 模型       | 昇腾值|社区值|
-|------------------------------------------------------------------------|------------|--------|------|
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-7B  | 77.8% | 77.6% |
-
-<h1 id="34b">Aquila2-34B</h1>
-<h2 id="34b-training">训练</h2>
-
-Aquila2-34B 训练的硬件配置如下:
-
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 16 x Ascend NPUs |
-
-
-<h3 id="34b-script">脚本</h3>
-
-1. 克隆仓库到本地服务器并切换到modellink分支
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. 搭建conda环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # 通过 PTA 包提供的 whl 安装 torch、torch_npu 和 apex，例如：
-    pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
-
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # 安装其余依赖包
-    pip install -r requirements.txt
-    ```
-
-3. 使用浏览器下载 [Aquila2-34B模型的配置，tokenizer，和预训练权重](https://huggingface.co/BAAI/Aquila2-34B/tree/main)
-
-    保存在 ModelLink/model_from_hf/Aquila2-34B/ 目录。
-
-4. 权重转换
-
-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila2-34B/ \
-        --save-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 2 \
-        --tokenizer-model ./model_from_hf/Aquila2-34B/tokenizer.json \
-        --params-dtype bf16
-    ```
-
-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila2-34B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Aquila2-34B/mg2hg/
-    ```
-
-    权重转换适用于预训练、微调、推理和评估，根据任务不同调整参数 `target-tensor-parallel-size`和 `target-pipeline-parallel-size`。
-
-5. 预训练
-
-    5.1 准备数据集
-
-    下载 Aquila2-34B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # 处理数据   
-    mkdir ./dataset/Aquila2-34B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
-        --output-prefix ./dataset/Aquila2-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 预训练
-
-    配置 Aquila2-34B 训练脚本: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"  #tokenizer 路径
-    DATA_PATH="./dataset/Aquila2-34B/alpaca_text_document"  #数据集 路径
-    CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp2/"
-    CKPT_SAVE_DIR="./ckpt/Aquila2-34B/"
-    ```
-
-    - 如果不需要加载权重，就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
-    - 如果不需要保存权重，就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
-    - 进行断点续训时，应先按以上save的场景配置，待完成ckpt保存后，再修改相应参数，按以上load的场景加载已保存的ckpt。
-
-    启动 Aquila2-34B 预训练脚本: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
-
-    ```shell
-    bash examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
-    ```
-
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-6. 微调
-
-    6.1 准备微调数据集
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/Aquila2-34B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
-        --output-prefix ./finetune_dataset/Aquila2-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-    全参微调的配置脚本基本和预训练脚本 `pretrain_aquila2_34b_ptd_16p.sh` 一致. *区别是数据集，以及增加训练参数`--is-instruction-dataset`*
-
-    增加微调参数`--finetune`，使微调从第一步开始。
-
-    ```bash
-    DATA_PATH="./finetune_dataset/Aquila2-34B/alpaca"
-    CKPT_LOAD_DIR="./ckpt/Aquila2-34B/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-
-<h3 id="34b-performance">性能</h3>
-
-<h4 id="34b-throughput">吞吐</h4>
-
-Aquila2-34B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 | 模型       | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
-|------|------------|------|------------------------|----------------------|
-| NPU  | Aquila2-34B  | 5000 | 854                  | 307                  | 
-| 参考  | Aquila2-34B  | 5000 | 732                  |    358               | 
-
-
-<h2 id="34b-inference">推理</h2>
-
-我们支持使用 Aquila2-34B进行文本生成的推理。
-
-推理与预训练不同，我们必须加载预训练权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。
-
-权重转换完成后，我们配置Aquila2-34B推理脚本`examples/aquila2/generate_aquila2_34b_ptd.sh`，需要正确指定加载权重的路径，词表路径等（下面样例仅供参考）
-
-```shell
-# 请按实际情况修改模型权重路径和分词器路径
-CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
-```
-
-启动Aquila2-34B推理:
-
-```shell
-bash examples/aquila2/generate_aquila2_34b_ptd.sh
-```
-
-部分推理样例如下：
-
-Aquila2-34B:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-34b-generate.png)
-
-<h2 id="34b-evaluation">评估</h2>
-
-我们使用 BoolQ benchmark 来评估我们的模型。在[Benchmark下载页面](https://github.com/google-research-datasets/boolean-questions)找到[数据集](https://storage.cloud.google.com/boolq/dev.jsonl)下载后保存。例如，保存在ModelLink/boolq/test目录下。
-
-评估与推理类似，也必须加载转换后的权重，请注意：在转换权重时使用的模型结构参数，和运行评估任务时使用的模型结构参数，应保持一致。
-
-权重转换完成后，我们配置Aquila2-34B评估脚本 `examples/aquila2/evaluate_aquila2_34b_ptd.sh`，需要正确指定加载权重的路径，词表路径，评估数据的路径，以及评估任务的名字等(下面样例仅供参考)：
-
-```shell
-CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
-EVAL_DATA_PATH="./boolq/test"
-TASK="boolq"
-```
-
-启动Aquila2-34B评估
-
-```shell
-bash examples/aquila2/evaluate_aquila2_34b_ptd.sh
-```
-
-Aquila2-34B在**Ascend NPU**中的评测表现：
-
-| 任务                                                                   | 模型       | 昇腾值|社区值|
-|------------------------------------------------------------------------|------------|--------|------|
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-34B  | 88.0% | 87.0% |
--- a/examples/aquila2/README_en.md
+++ b/examples/aquila2/README_en.md
@ -1,514 +0,0 @@
-# Aquila2  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$ 
-
-<p align="left">
-        <b><a href="README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
- [Aquila2-7B](#7b)
-  - [Training](#7b-training)
-    - [Script](#7b-script)
-    - [Performance](#7b-performance)
-      - [Machine performance](#7b-throughput)
-  - [Inference](#7b-inference)
-  - [Evaluation](#7b-evaluation)
-
- [Aquila2-34B](#34b)
-  - [Training](#34b-training)
-    - [Script](#34b-script)
-    - [Performance](#34b-performance)
-      - [Machine performance](#34b-throughput)
-  - [Inference](#34b-inference)
-  - [Evaluation](#34b-evaluation)
-
-<h1 id="7b">Aquila2-7B</h1>
-<h2 id="7b-training">Training</h2>
-
-Here's a hardware summary of pre-training Aquila2-7B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-<h3 id="7b-script">Script</h3>
-
-1. Clone the repository to your local server and switch to modellink branch:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. Build conda environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch, torch_npu and apex
-    pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
-
-    # source the set_env.sh file based on your host settings(you may need to change the path)
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt
-    ```
-
-3. Download the Aquila2-7B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila2-7B/tree/main)
-
-    save to ModelLink/model_from_hf/Aquila2-7B/ directory.
-
-4. Weights convert
-
-    HuggingFace weights --> Megatron weights
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```shell
-    # please modify the path to set_env.sh based on your environment.
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila2-7B/ \
-        --save-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --tokenizer-model ./model_from_hf/Aquila2-7B/tokenizer.json
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila2-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila2-7B/mg2hg/
-    ```
-
-    Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
-
-
-5. Pre-training
-   
-    5.1 Prepare dataset
-
-    Download the Aquila2-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets          
-    mkdir ./dataset/Aquila2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
-        --output-prefix ./dataset/Aquila2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 Pre-training
-    
-    Config Aquila2-7B pre-training script : examples/codellama/pretrain_aquila2_7b_ptd.sh
-
-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    CKPT_SAVE_DIR="./ckpt/Aquila2-7B/"
-    DATA_PATH="./dataset/Aquila2-7B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Aquila2-7B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-    ```
-    
-    - *If you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
-    - *If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save` parameter from the training script, and vice versa*
-    - *When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
-
-    Launch Aquila2-7B  pre-training script: examples/aquila2/pretrain_aquila2_7b_ptd.sh
-
-    ```shell
-    bash examples/aquila2/pretrain_aquila2_7b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
-
-6. Fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-
-    Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets  
-    mkdir ./finetune_dataset/Aquila2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
-        --output-prefix ./finetune_dataset/Aquila2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-
-    The configuration script for full parameters fine-tuning  is basically the same as that for `pretrain_aquila2_7b_ptd.sh`.*The difference is that the dataset and the training parameter `--is-instruction-dataset` are added.*
-
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-
-    ```bash
-    DATA_PATH="./finetune_dataset/Aquila2-7B/alpaca"
-    CKPT_LOAD_DIR="./ckpt/Aquila2-7B/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-<h3 id="7b-performance">Performance</h3>
-
-<h4 id="7b-throughput">Machine performance</h4>
-
-The performance of Aquila2-7B in Ascend NPU and reference device:
-
-| Device    | Model      | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
-| --------- | ---------- | ---------- | ---------------------------- | ----------------------------------- |
-| NPU       | Aquila2-7B | 5000       | 3323                         | 4.93                                |
-| Reference | Aquila2-7B | 5000       | 2673                         | 6.13                                |
-
-<h2 id="7b-inference">Inference</h2>
-
-We support MindSpeed Inference for text generation with Aquila 7B model.
-
-Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila2-7B Inference shell script `examples/aquila2/generate_aquila2_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila2-7B/". In your operation, please fill in correct value based on your actual scenario.
-
-```shell
-# please change to actual values
-CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
-```
-
-Start Aquila2-7B Inference:
-
-```shell
-bash ./examples/aquila2/generate_aquila2_7b_ptd.sh
-```
-
-Sample results of Aquila2-7B Inference:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-7b-generate.png)
-
-<h2 id="7b-evaluation">Evaluation</h2>
-
-We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark page](https://github.com/google-research-datasets/boolean-questions) and find the [dataset](https://storage.cloud.google.com/boolq/dev.jsonl), download it and save it. For example, save to "ModelLink/boolq/test" directory
-
-Evaluation task is similar to inference task too，it also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.
-
-After weight conversion is complete, we configure the Aquila2-7B evaluation script `examples/aquila2/evaluate_aquila2_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
-
-```shell
-CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
-EVAL_DATA_PATH="./boolq/test"
-TASK="boolq"
-```
-
-Start evaluation task
-
-```shell
-bash ./examples/aquila2/evaluate_aquila2_7b_ptd.sh
-```
-
-Sample Aquila2-7B performance running in **Ascend NPU**:
-
-| Task                                                                   | Model     | NPU   | Benchmark |
-| ---------------------------------------------------------------------- | --------- | ----- | --------- |
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-7B | 77.8% |  77.6%   |
-
-<h1 id="34b">Aquila2-34B</h1>
-<h2 id="34b-training">Training</h2>
-
-Here's a hardware summary of pre-training Aquila2-34B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               16 x Ascend NPUs                   |
-
-<h3 id="34b-script">Script</h3>
-
-1. Clone the repository to your local server and switch to modellink branch:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. Build conda environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch, torch_npu and apex
-    pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
-
-    # source the set_env.sh file based on your host settings(you may need to change the path)
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed/
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt
-    ```
-
-3. Download the Aquila2-34B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila2-34B/tree/main)
-
-    save to ModelLink/model_from_hf/Aquila2-34B/ directory.
-
-4. Weights convert
-
-    HuggingFace weights --> Megatron weights
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```shell
-    # please modify the path to set_env.sh based on your environment.
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --load-dir ./model_from_hf/Aquila2-34B/ \
-        --save-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 2 \
-        --tokenizer-model ./model_from_hf/Aquila2-34B/tokenizer.json \
-        --params-dtype bf16
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Aquila2-34B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila2-34B/mg2hg/
-    ```
-
-    Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
-
-
-5. Pre-training
-   
-    5.1 Prepare dataset
-
-    Download the Aquila2-34B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets          
-    mkdir ./dataset/Aquila2-34B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
-        --output-prefix ./dataset/Aquila2-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 Pre-training
-    
-    Config Aquila2-34B pre-training script : examples/codellama/pretrain_aquila2_34b_ptd_16p.sh
-
-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    CKPT_SAVE_DIR="./ckpt/Aquila2-34B/"
-    DATA_PATH="./dataset/Aquila2-34B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Aquila2-34B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp2/"
-    ```
-    
-    - *If you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
-    - *If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save` parameter from the training script, and vice versa*
-    - *When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
-
-    Launch Aquila2-34B  pre-training script: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
-
-    ```shell
-    bash examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh 
-    ```
-    **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
-
-6. Fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-
-    Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets  
-    mkdir ./finetune_dataset/Aquila2-34B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
-        --output-prefix ./finetune_dataset/Aquila2-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-
-    The configuration script for full parameters fine-tuning  is basically the same as that for `pretrain_aquila2_34b_ptd_16p.sh`.*The difference is that the dataset and the training parameter `--is-instruction-dataset` are added.*
-
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-
-    ```bash
-    DATA_PATH="./finetune_dataset/Aquila2-34B/alpaca"
-    CKPT_LOAD_DIR="./ckpt/Aquila2-34B/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-<h3 id="34b-performance">Performance</h3>
-
-<h4 id="34b-throughput">Machine performance</h4>
-
-The performance of Aquila2-34B in Ascend NPU and reference device:
-
-| Device    | Model       | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
-| --------- | ----------- | ---------- | ---------------------------- | ----------------------------------- |
-| NPU       | Aquila2-34B | 5000       | 854                          | 307                                 |
-| Reference | Aquila2-34B | 5000       | 732                          | 358                                 |
-
-<h2 id="34b-inference">Inference</h2>
-
-We support MindSpeed Inference for text generation with Aquila 34B model.
-
-Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila2-34B Inference shell script `examples/aquila2/generate_aquila2_34b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila2-34B/". In your operation, please fill in correct value based on your actual scenario.
-
-```shell
-# please change to actual values
-CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
-```
-
-Start Aquila2-34B Inference:
-
-```shell
-bash ./examples/aquila2/generate_aquila2_34b_ptd.sh
-```
-
-Sample results of Aquila2-34B Inference:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-34b-generate.png)
-
-<h2 id="34b-evaluation">Evaluation</h2>
-
-We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark page](https://github.com/google-research-datasets/boolean-questions) and find the [dataset](https://storage.cloud.google.com/boolq/dev.jsonl), download it and save it. For example, save to "ModelLink/boolq/test" directory
-
-Evaluation task is similar to inference task too，it also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.
-
-After weight conversion is complete, we configure the Aquila2-34B evaluation script `examples/aquila2/evaluate_aquila2_34b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
-
-```shell
-CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
-EVAL_DATA_PATH="./boolq/test"
-TASK="boolq"
-```
-
-Start evaluation task
-
-```shell
-bash ./examples/aquila2/evaluate_aquila2_34b_ptd.sh
-```
-
-Sample Aquila2-34B performance running in **Ascend NPU**:
-
-| Task                                                                   | Model     | NPU   | Benchmark |
-| ---------------------------------------------------------------------- | --------- | ----- | --------- |
-| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-34B | 88.0% |  87.0%   |
--- a/examples/aquila2/evaluate_aquila2_34b_ptd.sh
+++ b/examples/aquila2/evaluate_aquila2_34b_ptd.sh
@ -1,66 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-EVAL_DATA_PATH="your eval data dir"
-TASK="your task name"
-
-# Change for multinode config
-TP=8
-PP=1
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-    --attention-softmax-in-fp32 \
-    --bf16 \
-    --disable-bias-linear \
-    --exit-on-missing-checkpoint \
-    --ffn-hidden-size 24576 \
-    --group-query-attention \
-    --hidden-size 6144 \
-    --load $CKPT_LOAD_DIR \
-    --make-vocab-size-divisible-by 1 \
-    --max-new-tokens 1 \
-    --max-position-embeddings 4096 \
-    --micro-batch-size 1 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 48 \
-    --num-layers 60 \
-    --num-query-groups 8 \
-    --pipeline-model-parallel-size $PP \
-    --position-embedding-type rope \
-    --seq-length 4096 \
-    --swiglu \
-    --task $TASK \
-    --task-data-path $EVAL_DATA_PATH \
-    --tensor-model-parallel-size $TP \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-not-use-fast \
-    --tokenizer-type PretrainedFromHF \
-    --untie-embeddings-and-output-weights \
-    --use-fused-rmsnorm \
-    | tee logs/eval_aquila2_34b_${TASK}_ptd.log
--- a/examples/aquila2/evaluate_aquila2_7b_ptd.sh
+++ b/examples/aquila2/evaluate_aquila2_7b_ptd.sh
@ -1,64 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-EVAL_DATA_PATH="your eval data dir"
-TASK="your task name"
-
-# Change for multinode config
-TP=8
-PP=1
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-    "
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-    --attention-softmax-in-fp32 \
-    --disable-bias-linear \
-    --exit-on-missing-checkpoint \
-    --ffn-hidden-size 11008 \
-    --fp16 \
-    --hidden-size 4096 \
-    --load $CKPT_LOAD_DIR \
-    --make-vocab-size-divisible-by 1 \
-    --max-new-tokens 1 \
-    --max-position-embeddings 2048 \
-    --micro-batch-size 1 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 32 \
-    --num-layers 32 \
-    --pipeline-model-parallel-size ${PP} \
-    --position-embedding-type rope \
-    --seq-length 2048 \
-    --swiglu \
-    --task $TASK\
-    --task-data-path $EVAL_DATA_PATH \
-    --tensor-model-parallel-size ${TP} \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-not-use-fast \
-    --tokenizer-type PretrainedFromHF \
-    --untie-embeddings-and-output-weights \
-    --use-fused-rmsnorm \
-    | tee logs/eval_aquila2_7b_${TASK}_ptd.log
--- a/examples/aquila2/generate_aquila2_34b_ptd.sh
+++ b/examples/aquila2/generate_aquila2_34b_ptd.sh
@ -1,61 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-
-# Change for multinode config
-TP=8
-PP=1
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-    --attention-softmax-in-fp32 \
-    --bf16 \
-    --disable-bias-linear \
-    --exit-on-missing-checkpoint \
-    --ffn-hidden-size 24576 \
-    --group-query-attention \
-    --hidden-size 6144 \
-    --load $CKPT_LOAD_DIR \
-    --make-vocab-size-divisible-by 1 \
-    --max-new-tokens 512 \
-    --max-position-embeddings 4096 \
-    --micro-batch-size 1 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 48 \
-    --num-layers 60 \
-    --num-query-groups 8 \
-    --pipeline-model-parallel-size $PP \
-    --position-embedding-type rope \
-    --seq-length 4096 \
-    --swiglu \
-    --tensor-model-parallel-size $TP \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-not-use-fast \
-    --tokenizer-type PretrainedFromHF \
-    --untie-embeddings-and-output-weights \
-    --use-fused-rmsnorm \
-    | tee logs/generate_aquila2_34b_ptd.log
--- a/examples/aquila2/generate_aquila2_7b_ptd.sh
+++ b/examples/aquila2/generate_aquila2_7b_ptd.sh
@ -1,58 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-
-# Change for multinode config
-TP=8
-PP=1
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-    "
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-    --attention-softmax-in-fp32 \
-    --disable-bias-linear \
-    --exit-on-missing-checkpoint \
-    --ffn-hidden-size 11008 \
-    --hidden-size 4096 \
-    --load $CKPT_LOAD_DIR \
-    --make-vocab-size-divisible-by 1 \
-    --max-new-tokens 512 \
-    --max-position-embeddings 2048 \
-    --micro-batch-size 1 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 32 \
-    --num-layers 32 \
-    --pipeline-model-parallel-size ${PP} \
-    --position-embedding-type rope \
-    --seq-length 2048 \
-    --swiglu \
-    --tensor-model-parallel-size ${TP} \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-not-use-fast \
-    --tokenizer-type PretrainedFromHF \
-    --untie-embeddings-and-output-weights \
-    --use-fused-rmsnorm \
-    | tee logs/generate_aquila2_7b_ptd.log
--- a/examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
+++ b/examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
@ -1,96 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_SAVE_DIR="your checkpoint save dir"
-DATA_PATH="your training data dir"
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-
-# Change for multinode config
-TP=8
-PP=2
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=2
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --attention-dropout 0.0 \
-    --attention-softmax-in-fp32 \
-    --bf16 \
-    --clip-grad 1.0 \
-    --disable-bias-linear \
-    --ffn-hidden-size 24576 \
-    --global-batch-size 1024 \
-    --group-query-attention \
-    --hidden-dropout 0.0 \
-    --hidden-size 6144 \
-    --init-method-std 0.01 \
-    --initial-loss-scale 524288.0 \
-    --lr 8.0e-8 \
-    --lr-decay-style cosine \
-    --lr-warmup-fraction 0.01 \
-    --make-vocab-size-divisible-by 1 \
-    --max-position-embeddings 4096 \
-    --micro-batch-size 2 \
-    --min-lr 1.0e-8 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 48 \
-    --num-layers 60 \
-    --num-query-groups 8 \
-    --pipeline-model-parallel-size ${PP} \
-    --position-embedding-type rope \
-    --seq-length 4096 \
-    --sequence-parallel \
-    --swiglu \
-    --tensor-model-parallel-size ${TP} \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-type PretrainedFromHF \
-    --train-iters 2000 \
-    --untie-embeddings-and-output-weights \
-    --use-flash-attn \
-    --use-fused-rmsnorm \
-    --use-mc2 \
-    --weight-decay 1e-2 \
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 1000 \
-    --eval-interval 1000 \
-    --eval-iters 0
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load $CKPT_LOAD_DIR \
-    | tee logs/train_aquila2_34b_ptd.log
--- a/examples/aquila2/pretrain_aquila2_7b_ptd.sh
+++ b/examples/aquila2/pretrain_aquila2_7b_ptd.sh
@ -1,94 +0,0 @@
-#!/bin/bash
-
-# See README, please remember to source the set_env.sh file in CLI, or here
-# source /path/to/your/ascend-toolkit/set_env.sh
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CKPT_SAVE_DIR="your checkpoint save dir"
-DATA_PATH="your training data dir"
-CKPT_LOAD_DIR="your checkpoint load dir"
-TOKENIZER_PATH="your tokenizer path"
-
-# Change for multinode config
-TP=8
-PP=1
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT 
-"
-
-GPT_ARGS="
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --attention-dropout 0.0 \
-    --attention-softmax-in-fp32 \
-    --clip-grad 1.0 \
-    --disable-bias-linear \
-    --ffn-hidden-size 11008 \
-    --fp16 \
-    --global-batch-size 64 \
-    --hidden-dropout 0.0 \
-    --hidden-size 4096 \
-    --init-method-std 0.01 \
-    --initial-loss-scale 65536 \
-    --lr 1.0e-7 \
-    --lr-decay-style cosine \
-    --lr-warmup-fraction 0.01 \
-    --make-vocab-size-divisible-by 1 \
-    --max-position-embeddings 2048 \
-    --micro-batch-size 8 \
-    --min-lr 1.0e-8 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --no-masked-softmax-fusion \
-    --norm-epsilon 1e-5 \
-    --normalization RMSNorm \
-    --num-attention-heads 32 \
-    --num-layers 32 \
-    --pipeline-model-parallel-size ${PP} \
-    --position-embedding-type rope \
-    --seq-length 2048 \
-    --sequence-parallel \
-    --swiglu \
-    --tensor-model-parallel-size ${TP} \
-    --tokenizer-name-or-path $TOKENIZER_PATH \
-    --tokenizer-type PretrainedFromHF \
-    --train-iters 2000 \
-    --untie-embeddings-and-output-weights \
-    --use-flash-attn \
-    --use-fused-rmsnorm \
-    --use-mc2 \
-    --weight-decay 1e-1 \
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 1000 \
-    --eval-interval 1000 \
-    --eval-iters 0
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load $CKPT_LOAD_DIR \
-    | tee logs/train_aquila2_7b_ptd.log
--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -1,4 +1,4 @@
-# BaiChuan
+# BaiChuan  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
 <p align="left">
        <b>简体中文</b> |
        <b><a href="README_en.md">English</a> </b> 
@ -22,6 +22,7 @@
    - [脚本](#脚本)
    - [性能](#性能)
        - [吞吐](#吞吐)
+  - [Lora微调](#Lora微调)
  - [推理](#推理)
  - [评估](#评估)

@ -37,146 +38,148 @@ Baichuan-7B 训练的硬件配置如下：

 ### 脚本

-1. 克隆仓库到本地服务器：
+1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. （可选）准备预训练权重

-    从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重：
+从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重：

-    ```shell
-    mkdir ./model_from_hf/Baichuan-7B/
-    cd ./model_from_hf/Baichuan-7B/
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan-7B/
+cd ./model_from_hf/Baichuan-7B/
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. 数据转换

-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/Baichuan-7B/ \
+    --save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
+    --w-pack True  
+```

-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Baichuan-7B/ \
-        --save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
-        --w-pack True  
-    ```
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan-7B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan-7B/mg2hg/
+```

 5. 准备数据集

-    从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集：
+从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集：

-    ```shell
-    # 下载数据集
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# 下载数据集
+cd ./dataset
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # 处理数据              
-    mkdir ./dataset/Baichuan-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
-        --output-prefix ./dataset/Baichuan-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# 处理数据              
+mkdir ./dataset/Baichuan-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
+    --output-prefix ./dataset/Baichuan-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF
+```

 6. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
-    DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
-    ```
+CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
+DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
+```

 7. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh

-    ```shell
-    bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -193,7 +196,7 @@ Baichuan-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

 ## 推理

-首先需要配置Baichuan-7B的推理脚本: examples/baichuan/generate_baichuan_7b_ptd.sh
+首先需要配置Baichuan-7B的推理脚本: tasks/inference/generate_baichuan_7b_ptd.sh

 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
@ -207,12 +210,12 @@ TOKENIZER_PATH="./model_from_hf/Baichuan-7B/"
 然后可直接启动generate_baichuan_7b_ptd.sh

 ```bash
-bash examples/baichuan/generate_baichuan_7b_ptd.sh
+bash tasks/inference/generate_baichuan_7b_ptd.sh
 ```

 推理的示例如下:

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_7B_inference.png)
+![Inference](../../sources/images/baichuan/baichuan_7B_inference.png)

 ## 评估

@ -228,7 +231,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan/evaluate_baichuan_7B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
 ```

 <table>
@ -266,152 +269,154 @@ Baichuan-13B 训练的硬件配置如下:

 1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
+# 安装其余依赖库
+pip install -r requirements.txt 

-    ```
+```

-    **注意：**在后面的任务执行过程中如果出现报错：`AttributeError: 'BaichuanTokenizer’ object has no attribute 'sp_model'`，请执行下面命令解决这个问题：
+**注意：**在后面的任务执行过程中如果出现报错：`AttributeError: 'BaichuanTokenizer’ object has no attribute 'sp_model'`，请执行下面命令解决这个问题：

-    ```shell
-    pip install transformers==4.32.0 --force
-    ```
+```shell
+pip install transformers==4.32.0 --force
+```

 3. （可选的）准备预训练权重

-    从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重
+从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重

-    ```shell
-    mkdir ./model_from_hf/Baichuan-13B/
-    cd ./model_from_hf/Baichuan-13B/
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan-13B/
+cd ./model_from_hf/Baichuan-13B/
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
+cd ../../
+```

 4. 权重转换

-    将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-      
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan-13B/ \
-        --save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True  
-    ```
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan-13B/ \
+    --save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True  
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan-13B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan-13B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan-13B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan-13B/mg2hg/
+```

 5. 准备数据集

-    下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    mkdir ./dataset/Baichuan-13B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
-        --output-prefix ./dataset/Baichuan-13B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF 
-    ```
+mkdir ./dataset/Baichuan-13B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
+    --output-prefix ./dataset/Baichuan-13B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF 
+```

 6. 配置 Baichuan-13B 训练脚本(Baichuan-13B暂不支持Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
-    DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/" 
-    ```
+CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
+DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/" 
+```

 7. 启动 Baichuan-13B 训练脚本: examples/baichuan/pretrain_baichuan_ptd_13B.sh

-    ```bash
-    bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```bash
+bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -429,7 +434,7 @@ Baichuan-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:

 ## 推理

-配置baichuan-13B的推理脚本: examples/baichuan/generate_baichuan_13b_ptd.sh
+配置baichuan-13B的推理脚本: tasks/inference/generate_baichuan_13b_ptd.sh

 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
@ -443,11 +448,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan-13B/"
 然后可直接启动generate_baichuan_13b_ptd.sh

 ```bash
-bash examples/baichuan/generate_baichuan_13b_ptd.sh
+bash tasks/inference/generate_baichuan_13b_ptd.sh
 ```

 推理的示例如下:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_13B_inference.png)
+![Inference](../../sources/images/baichuan/baichuan_13B_inference.png)

 ## 评估

@ -463,7 +468,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan/evaluate_baichuan_13B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
 ```

 <table>
--- a/examples/baichuan/README_en.md
+++ b/examples/baichuan/README_en.md
@ -1,4 +1,4 @@
-# BaiChuan
+# BaiChuan  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 
 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
        <b>English</b> 
@ -40,143 +40,146 @@ Here's a hardware summary of pre-training Baichuan-7B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Prepare pretrained weights
-    Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 
+Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 

-    ```shell
-    mkdir ./model_from_hf/Baichuan-7B/
-    cd ./model_from_hf/Baichuan-7B/
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
-    wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan-7B/
+cd ./model_from_hf/Baichuan-7B/
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
+wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. Weights convert

-    In order to adapt to the Baichuan-7B model, the following script is used to convert the model pre-training weights.
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+In order to adapt to the Baichuan-7B model, the following script is used to convert the model pre-training weights.
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    # modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```shell
+# modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/Baichuan-7B/ \
+    --save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
+    --w-pack True  
+```

-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Baichuan-7B/ \
-        --save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
-        --w-pack True  
-    ```
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-7B/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-7B/mg2hg/
+```

 5. Prepare dataset

-    Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# download datasets
+cd ./dataset
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # process datasets          
-    mkdir ./dataset/Baichuan-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
-        --output-prefix ./dataset/Baichuan-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# process datasets          
+mkdir ./dataset/Baichuan-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
+    --output-prefix ./dataset/Baichuan-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF
+```

 6. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_ptd_7B.sh

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
-    DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
-    ```
+CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
+DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
+```

-7. Launch Baichuan-7B  pre-training script: examples/baichuan/pretrain_baichuan_ptd_7B.sh
+7. Launch Baichuan-7B  pre-training script: tasks/inference/generate_baichuan_7b_ptd.sh
+
+```shell
+bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

-    ```shell
-    bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.


 ### Performance
@ -193,7 +196,7 @@ The performance of Baichuan-7B in **Ascend NPU** and **Reference**:

 ## Inference

-Config Baichuan-7B inference script: examples/baichuan/generate_baichuan_7b_ptd.sh
+Config Baichuan-7B inference script: tasks/inference/generate_baichuan_7b_ptd.sh

 ```bash
 # modify the script according to your own ascend-toolkit path
@ -204,15 +207,15 @@ CHECKPOINT="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Baichuan-7B/"
 ```

-Launch Baichuan-7B inference script: examples/baichuan/generate_baichuan_7b_ptd.sh
+Launch Baichuan-7B inference script: tasks/inference/generate_baichuan_7b_ptd.sh

 ```bash
-bash examples/baichuan/generate_baichuan_7b_ptd.sh
+bash tasks/inference/generate_baichuan_7b_ptd.sh
 ```

 Some inference samples are as follows:

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_7B_inference.png)
+![Inference](../../sources/images/baichuan/baichuan_7B_inference.png)

 ## Evaluation

@ -228,7 +231,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan/evaluate_baichuan_7B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
 ```

 <table>
@ -268,154 +271,156 @@ Here's a hardware summary of pre-training Baichuan-13B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    #install Mindspeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+#install Mindspeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

-    **Note:** If the error message "'AttributeError: 'BaichuanTokenizer' object has no attribute'sp_model'" is displayed during the script execution, run the following command to rectify the error:
+**Note:** If the error message "'AttributeError: 'BaichuanTokenizer' object has no attribute'sp_model'" is displayed during the script execution, run the following command to rectify the error:

-    ```shell
-    pip install transformers==4.32.0 --force
-    ```
+```shell
+pip install transformers==4.32.0 --force
+```

 3. Prepare pretrained weights


-    Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 
-    
-    ```shell
-    mkdir ./model_from_hf/Baichuan-13B/
-    cd ./model_from_hf/Baichuan-13B/
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
-    cd ../../
-    ```
+Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 
+
+```shell
+mkdir ./model_from_hf/Baichuan-13B/
+cd ./model_from_hf/Baichuan-13B/
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
+cd ../../
+```

 4. Weights convert

-    In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
+In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.

-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    mkdir baichuan-13B-mt
+```shell
+mkdir baichuan-13B-mt

-    # modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan-13B/ \
-        --save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True  
-    ```
+# modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan-13B/ \
+    --save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True  
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan-13B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-13B/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan-13B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-13B/mg2hg/
+```

 5. Prepare dataset
-    Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    cd ./dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    mkdir ./dataset/Baichuan-13B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
-        --output-prefix ./dataset/Baichuan-13B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF 
-    ```
+```shell
+cd ./dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+
+mkdir ./dataset/Baichuan-13B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
+    --output-prefix ./dataset/Baichuan-13B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF 
+```

 6. Config Baichuan-13B pre-training script(Baichuan-13B does not support Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
-    DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/" 
-    ```
+CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
+DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/" 
+```

 7. Launch Baichuan-13B pre-training script: examples/baichuan/pretrain_baichuan_ptd_13B.sh

-    ```bash
-    bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```bash
+bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -431,9 +436,7 @@ The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:


 ## Inference
-
-Config baichuan-13B inference script: examples/baichuan/generate_baichuan_13b_ptd.sh
-
+Config baichuan-13B inference script: tasks/inference/generate_baichuan_13b_ptd.sh
 ```bash
 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -442,15 +445,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 CHECKPOINT="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Baichuan-13B/"
 ```
-
-Launch baichuan-13B inference script: examples/baichuan/generate_baichuan_13b_ptd.sh
-
+Launch baichuan-13B inference script: tasks/inference/generate_baichuan_13b_ptd.sh
 ```bash
-bash examples/baichuan/generate_baichuan_13b_ptd.sh
+bash tasks/inference/generate_baichuan_13b_ptd.sh
 ```

 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_13B_inference.png)
+![Inference](../../sources/images/baichuan/baichuan_13B_inference.png)

 ## Evaluation

@ -466,7 +467,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan/evaluate_baichuan_13B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
 ```

 <table>
--- a/examples/baichuan/pretrain_baichuan_ptd_13B.sh
+++ b/examples/baichuan/pretrain_baichuan_ptd_13B.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -83,6 +84,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_baichuan_13b.log
--- a/examples/baichuan/pretrain_baichuan_ptd_7B.sh
+++ b/examples/baichuan/pretrain_baichuan_ptd_7B.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -86,6 +87,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_baichuan_7b.log
--- a/examples/baichuan2/README.md
+++ b/examples/baichuan2/README.md
@ -1,4 +1,4 @@
-# BaiChuan2
+# BaiChuan2  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
 <p align="left">
        <b>简体中文</b> |
        <b><a href="README_en.md">English</a> </b> 
@ -35,149 +35,151 @@ Baichuan2-7B 训练的硬件配置如下：

 ### 脚本

-1. 克隆仓库到本地服务器：
+1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. （可选）准备预训练权重

-    从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重：
+从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重：

-    ```shell
-    mkdir ./model_from_hf/Baichuan2-7B/
-    cd ./model_from_hf/Baichuan2-7B/
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan2-7B/
+cd ./model_from_hf/Baichuan2-7B/
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. 数据转换

-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan2-7B/ \
-        --save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True   
-    ```
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan2-7B/ \
+    --save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True   
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan2-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan2-7B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan2-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan2-7B/mg2hg/
+```

 5. 准备数据集

-    从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集：
+从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集：

-    ```shell
-    # 下载数据集
-    cd ./dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# 下载数据集
+cd ./dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # 准备数据集       
-    mkdir ./dataset/Baichuan2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
-        --output-prefix ./dataset/Baichuan2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# 准备数据集       
+mkdir ./dataset/Baichuan2-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
+    --output-prefix ./dataset/Baichuan2-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF
+```

 6. 配置 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 修改数据集，权重，词表等路径
-    CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
-    DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
-    ```
+# 修改数据集，权重，词表等路径
+CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
+DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
+```

 7. 启动 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh

-    ```shell
-    bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -194,7 +196,7 @@ Baichuan2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

 ## 推理

-首先需要配置baichuan2-7B的推理脚本: examples/baichuan2/generate_baichuan2_7b_ptd.sh
+首先需要配置baichuan2-7B的推理脚本: tasks/inference/generate_baichuan2_7b_ptd.sh

 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
@ -208,12 +210,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan2-7B/"
 然后可直接启动generate_baichuan2_7b_ptd.sh

 ```bash
-bash examples/baichuan2/generate_baichuan2_7b_ptd.sh
+bash tasks/inference/generate_baichuan2_7b_ptd.sh
 ```

 推理的示例如下:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_7B_inference.png)
+![Inference](../../sources/images/baichuan2/baichuan2_7B_inference.png)

 ## 评估

@ -229,7 +230,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan2/evaluate_baichuan2_7B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan2_7B_ptd.sh
 ```

 <table>
@ -265,148 +266,150 @@ Baichuan2-13B 训练的硬件配置如下:

 ### 脚本

-1. 克隆仓库到本地服务器：
+1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. （可选的）准备预训练权重

-    从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 下载预训练权重
+从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 下载预训练权重

-    ```shell
-    mkdir ./model_from_hf/Baichuan2-13B/
-    cd ./model_from_hf/Baichuan2-13B/
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan2-13B/
+cd ./model_from_hf/Baichuan2-13B/
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
+cd ../../
+```

 4. 权重转换

-    将 BaiChuan2-13B 模型权重从 huggingface 格式转换为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将 BaiChuan2-13B 模型权重从 huggingface 格式转换为 megatron 格式
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan2-13B/ \
-        --save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True  
-    ```
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan2-13B/ \
+    --save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True  
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan2-13B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan2-13B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan2-13B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Baichuan2-13B/mg2hg/
+```

 5. 准备数据集

-    下载 Baichuan2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
+下载 Baichuan2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    mkdir ./dataset/Baichuan2-13B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
-        --output-prefix ./dataset/Baichuan2-13B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF 
-    ```
+mkdir ./dataset/Baichuan2-13B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
+    --output-prefix ./dataset/Baichuan2-13B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF 
+```

 6. 配置 Baichuan2-13B 训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 修改词表，数据集, 权重等路径等路径
-    CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
-    DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/" 
-    ```
+# 修改词表，数据集, 权重等路径等路径
+CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
+DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/" 
+```

 7. 启动 Baichuan2-13B 训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh

-    ```bash
-    bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```bash
+bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -415,15 +418,15 @@ Baichuan2-13B 训练的硬件配置如下:
 Baichuan2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:

 |  设备  |            模型          | 迭代数  | 样本吞吐 (samples/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 
-|:----:|:-------------------------:|:----:|:----------------:|:--------------------:|:---------------:|
-| NPUs | Baichuan2-13B | 1000 |        -         |         1668         |        -        |
-|  参考  | Baichuan2-13B | - |        -         |         2062         |        -        |
+|:----:|:-------------------------:|:----:|:------------------:|:--------------------:|:---------------:|
+| NPUs | Baichuan2-13B | 1000 |1.83|         1310         | 4.35 |
+|  参考  | Baichuan2-13B | - | - |         872          |- |



 ## 推理

-首先需要配置baichuan2-13B的推理脚本: examples/baichuan2/generate_baichuan2_13b_ptd.sh
+首先需要配置baichuan2-13B的推理脚本: tasks/inference/generate_baichuan2_13b_ptd.sh

 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
@ -437,12 +440,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan2-13B/"
 然后可直接启动generate_baichuan2_13b_ptd.sh

 ```bash
-bash examples/baichuan2/generate_baichuan2_13b_ptd.sh
+bash tasks/inference/generate_baichuan2_13b_ptd.sh
 ```

 推理的示例如下:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_13B_inference.png)
+![Inference](../../sources/images/baichuan2/baichuan2_13B_inference.png)

 ## 评估

@ -458,7 +460,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
 ```

 <table>
--- a/examples/baichuan2/README_en.md
+++ b/examples/baichuan2/README_en.md
@ -1,4 +1,4 @@
-# BaiChuan2
+# BaiChuan2  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 
 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
        <b>English</b> 
@ -37,146 +37,148 @@ Here's a hardware summary of pre-training Baichuan2-7B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Prepare pretrained weights
-    Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)
+Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)

-    ```shell
-    mkdir ./model_from_hf/Baichuan2-7B/
-    cd ./model_from_hf/Baichuan2-7B/
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
-    wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan2-7B/
+cd ./model_from_hf/Baichuan2-7B/
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
+wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. Weights convert

-    In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    # modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```shell
+# modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan2-7B/ \
+    --save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True   
+```

-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan2-7B/ \
-        --save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True   
-    ```
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan2-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-7B/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan2-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-7B/mg2hg/
+```

 5. Prepare dataset

-    Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    # download datasets
-    cd ./dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# download datasets
+cd ./dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # process datasets      
-    mkdir ./dataset/Baichuan2-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
-        --output-prefix ./dataset/Baichuan2-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# process datasets      
+mkdir ./dataset/Baichuan2-7B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
+    --output-prefix ./dataset/Baichuan2-7B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF
+```

 6. Config Baichuan2-7B pre-training script : examples/baichuan2/pretrain_baichuan2_ptd_7B.sh

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # modify script orign dataset path according to your own dataset path
-    CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
-    DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
-    ```
+# modify script orign dataset path according to your own dataset path
+CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
+DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
+```

 7. Launch Baichuan2-7B  pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh

-    ```shell
-    bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```shell
+bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -192,9 +194,7 @@ The performance of Baichuan2-7B in **Ascend NPU** and **Reference**:


 ## Inference
-
-Config baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh
-
+Config baichuan2-7B inference script: tasks/inference/generate_baichuan2_7b_ptd.sh
 ```bash
 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -203,15 +203,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 CHECKPOINT="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Baichuan2-7B/"
 ```
-
-Launch baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh
-
+Launch baichuan2-7B inference script: tasks/inference/generate_baichuan2_7b_ptd.sh
 ```bash
-bash examples/baichuan2/generate_baichuan2_7b_ptd.sh
+bash tasks/inference/generate_baichuan2_7b_ptd.sh
 ```

 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_7B_inference.png)
+![Inference](../../sources/images/baichuan2/baichuan2_7B_inference.png)

 ## Evaluation

@ -227,7 +225,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
 ```

 <table>
@ -267,145 +265,147 @@ Here's a hardware summary of pre-training Baichuan2-13B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git 
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Prepare pretrained weights

-    Download the Baichuan2-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 
+Download the Baichuan2-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 

-    ```shell
-    mkdir ./model_from_hf/Baichuan2-13B/
-    cd ./model_from_hf/Baichuan2-13B/
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
-    wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Baichuan2-13B/
+cd ./model_from_hf/Baichuan2-13B/
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
+wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
+cd ../../
+```

 4. Weights convert

-    In order to adapt to the baichuan2-13B model, the following script is used to convert the model pre-training weights.
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+In order to adapt to the baichuan2-13B model, the following script is used to convert the model pre-training weights.
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    # modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --load-dir ./model_from_hf/Baichuan2-13B/ \
-        --save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
-        --params-dtype bf16 \
-        --w-pack True  
-    ```
+```shell
+# modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+   
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --load-dir ./model_from_hf/Baichuan2-13B/ \
+    --save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
+    --params-dtype bf16 \
+    --w-pack True  
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    ```shell
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --w-pack True \
-        --save-dir ./model_from_hf/Baichuan2-13B/     # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-13B/mg2hg/
-    ```
+```shell
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --w-pack True \
+    --save-dir ./model_from_hf/Baichuan2-13B/     # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-13B/mg2hg/
+```

 5. Prepare dataset

-    Download the Baichuan2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
+Download the Baichuan2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    mkdir ./dataset/Baichuan2-13B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
-        --output-prefix ./dataset/Baichuan2-13B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF 
-    ```
+mkdir ./dataset/Baichuan2-13B/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
+    --output-prefix ./dataset/Baichuan2-13B/alpaca \
+    --workers 4 \
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF 
+```

 6. Config Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # modify script orign dataset path according to your own dataset path
-    CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
-    DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/" 
-    ```
+# modify script orign dataset path according to your own dataset path
+CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
+DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
+TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
+CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/" 
+```

 7. Launch Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh

-    ```bash
-    bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```bash
+bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -414,16 +414,14 @@ Here's a hardware summary of pre-training Baichuan2-13B:
 The performance of the Baichuan2-13B in **Ascend NPU** and **Reference**:

 | Device |     Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | 
-|:----:|:-------------------------:|:----:|:-----------------------------:|:----------------------------:|:-------------------------:|
-| NPUs | Baichuan2-13B |1000 |               -               |             1668             |             -             |
-|  Reference  | Baichuan2-13B |-|               -               |             2062             |             -             |
+|:----:|:-------------------------:|:----:|:------------------:|:----------------------------:|:---------------:|
+| NPUs | Baichuan2-13B |1000 |1.83|             1310             | 4.35 |
+|  Reference  | Baichuan2-13B |-|-|             872              |- |



 ## Inference
-
-Config baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh
-
+Config baichuan2-13B inference script: tasks/inference/generate_baichuan2_13b_ptd.sh
 ```bash
 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -432,15 +430,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 CHECKPOINT="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Baichuan2-13B/"
 ```
-
-Launch baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh
-
+Launch baichuan2-13B inference script: tasks/inference/generate_baichuan2_13b_ptd.sh
 ```bash
-bash examples/baichuan2/generate_baichuan2_13b_ptd.sh
+bash tasks/inference/generate_baichuan2_13b_ptd.sh
 ```

 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_13B_inference.png)
+![Inference](../../sources/images/baichuan2/baichuan2_13B_inference.png)

 ## Evaluation

@ -456,7 +452,7 @@ TASK="boolq"
 ```

 ```shell
-bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
+bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
 ```

 <table>
--- a/examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
+++ b/examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -39,8 +40,8 @@ GPT_ARGS="
    --seq-length 4096 \
    --disable-bias-linear \
    --max-position-embeddings 4096 \
-    --micro-batch-size 2 \
-    --global-batch-size 128 \
+    --micro-batch-size 1 \
+    --global-batch-size 8 \
    --untie-embeddings-and-output-weights \
    --no-gradient-accumulation-fusion \
    --make-vocab-size-divisible-by 32 \
@ -56,8 +57,6 @@ GPT_ARGS="
    --normalization RMSNorm \
    --use-fused-rmsnorm \
    --use-flash-attn \
-    --use-fused-swiglu \
-    --use-mc2 \
    --swiglu \
    --no-masked-softmax-fusion \
    --attention-softmax-in-fp32 \
@ -73,7 +72,7 @@ GPT_ARGS="
    --adam-eps 1.0e-8 \
    --no-load-optim \
    --no-load-rng \
-    --bf16
+    --fp16
 "

 DATA_ARGS="
@ -93,6 +92,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_baichuan2_13b.log
--- a/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
+++ b/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -85,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_baichuan2_7b.log
--- a/examples/bloom/README.md
+++ b/examples/bloom/README.md
@ -1,4 +1,4 @@
-# Bloom
+# Bloom  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$

 <p align="left">
        <b>简体中文</b> |
@ -6,6 +6,8 @@
    </p>
 </p>

+[toc]
+
 # Bloom-7B

 ## 训练
@ -18,138 +20,140 @@ Bloom-7B 训练的硬件配置如下：

 ### 脚本

-1. 克隆仓库到本地服务器：
+1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. 准备预训练权重

-    首先下载 Bloom-7B 的 [权重](https://huggingface.co/bigscience/bloom-7b1/tree/main)
+首先下载 Bloom-7B 的 [权重](https://huggingface.co/bigscience/bloom-7b1/tree/main)

-    ```shell
-    mkdir ./model_from_hf/Bloom-7B/
-    cd ./model_from_hf/Bloom-7B/
-    cd tokenizer
-    wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-    ...
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Bloom-7B/
+cd ./model_from_hf/Bloom-7B/
+cd tokenizer
+wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
+...
+cd ../../
+```

 4. 权重转换

-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader loader_bloom_hf \
-        --saver saver_megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Bloom-7B/ \
-        --save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model None 
-    ```
+```shell
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader loader_bloom_hf \
+    --saver saver_megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/Bloom-7B/ \
+    --save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model None 
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --embed-layernorm \
-        --save-dir ./model_from_hf/Bloom-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Bloom-7B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --embed-layernorm \
+    --save-dir ./model_from_hf/Bloom-7B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Bloom-7B/mg2hg/
+```

 5. 准备数据集

-    下载 Bloom 7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
+下载 Bloom 7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    # 下载数据
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# 下载数据
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # 处理数据         
-    mkdir ./dataset/Bloom-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
-        --output-prefix ./dataset/Bloom-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# 处理数据         
+mkdir ./dataset/Bloom-7B/
+python ./tools/preprocess_data.py \
+  --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+  --tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
+  --output-prefix ./dataset/Bloom-7B/alpaca \
+  --workers 4 \
+  --log-interval 1000 \
+  --tokenizer-type PretrainedFromHF
+```

 6. 配置 Bloom-7B 预训练脚本(Bloom-7B暂不支持Flash Attention): examples/bloom/pretrain_bloom_ptd_7B.sh

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
-    DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
-    TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
-    CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
-    ```
+CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
+DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
+TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
+CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
+```

 7. 启动 Bloom-7B 预训练脚本: examples/bloom/pretrain_bloom_ptd_7B.sh

-    ```shell
-    bash examples/bloom/pretrain_bloom_ptd_7B.sh 
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/bloom/pretrain_bloom_ptd_7B.sh 
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -165,9 +169,7 @@ Bloom-7B


 ## Bloom-7B推理
-
-首先配置Bloom-7B 推理脚本: examples/bloom/generate_bloom_ptd_7B.sh
-
+首先配置Bloom-7B 推理脚本: tasks/inference/generate_bloom_ptd_7B.sh 
 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -180,16 +182,16 @@ TOKENIZER_PATH="./model_from_hf/Bloom-7B-Base/"
 然后可直接启动generate_bloom_7b_ptd.sh

 ```bash
-bash examples/bloom/generate_bloom_7b_ptd.sh
+bash tasks/inference/generate_bloom_7b_ptd.sh
 ```

 推理示例如下：

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom7b-generate.png)
+![Inference](../../sources/images/bloom/bloom7b-generate.png)

 ## Bloom-7B评测

-配置Bloom-7B 评估脚本: examples/bloom/evaluate_bloom_7b_ptd.sh
+配置Bloom-7B 评估脚本: tasks/evaluation/evaluate_bloom_7b_ptd.sh

 ```bash
 # ascend-toolkit 路径
@ -206,7 +208,7 @@ TASK="your task"
 启动评估

 ```bash
-bash examples/bloom/evaluate_bloom_7B_ptd.sh
+bash tasks/evaluation/evaluate_bloom_7B_ptd.sh
 ```

 MMLU评测得分
@ -233,141 +235,142 @@ Bloom-176B 训练的硬件配置:

 1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. 准备预训练权重

-    下载 Bloom-176B [权重](https://huggingface.co/bigscience/bloom/tree/main)
+下载 Bloom-176B [权重](https://huggingface.co/bigscience/bloom/tree/main)

-    ```shell
-    mkdir ./model_from_hf/Bloom-176B/
-    cd ./model_from_hf/Bloom-176B/
-    wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-    ...
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Bloom-176B/
+cd ./model_from_hf/Bloom-176B/
+wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
+...
+cd ../../
+```

 4. 权重转换
-    将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+   将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
+   ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader loader_bloom_hf \
-        --saver saver_megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 5 \
-        --load-dir ./model_from_hf/Bloom-176B/ \
-        --save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
-        --tokenizer-model None \
-        --params-dtype bf16  
-        # config.json中同字段对应的key值与其他模型不一致，将文件中的n_embed改为hidden_size， 将num_attention_heads修改为n_head。
-    ```
+```shell
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader loader_bloom_hf \
+    --saver saver_megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 5 \
+    --load-dir ./model_from_hf/Bloom-176B/ \
+    --save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
+    --tokenizer-model None \
+    --params-dtype bf16  
+    # config.json中同字段对应的key值与其他模型不一致，将文件中的n_embed改为hidden_size， 将num_attention_heads修改为n_head。
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --embed-layernorm \
-        --params-dtype bf16 \
-        --save-dir ./model_from_hf/Bloom-176B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Bloom-176B/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --embed-layernorm \
+    --params-dtype bf16 \
+    --save-dir ./model_from_hf/Bloom-176B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Bloom-176B/mg2hg/
+```

 5. 准备数据集

-    下载 Bloom 176B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
+下载 Bloom 176B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    # 下载数据
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# 下载数据
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # 处理数据        
-    mkdir ./dataset/Bloom-176B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
-        --output-prefix ./dataset/Bloom-176B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# 处理数据        
+mkdir ./dataset/Bloom-176B/
+python ./tools/preprocess_data.py \
+  --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+  --tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
+  --output-prefix ./dataset/Bloom-176B/alpaca \
+  --workers 4 \
+  --log-interval 1000 \
+  --tokenizer-type PretrainedFromHF
+```

 6. 配置 Bloom-176B 预训练脚本(Bloom-176B暂不支持Flash Attention): examples/bloom/pretrain_bloom_176b.sh

-    ```shell
-    # 修改 MASTER_ADDR 为主节点
-    MASTER_ADDR=localhost
+```shell
+# 修改 MASTER_ADDR 为主节点
+MASTER_ADDR=localhost

-    # 修改每个节点的节点序号，主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
-    NODE_RANK=0
+# 修改每个节点的节点序号，主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
+NODE_RANK=0

-    # 修改数据集路径和词表路径
-    TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
-    DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
-    ```
+# 修改数据集路径和词表路径
+TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
+DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
+```

 7. 启动 Bloom-176B 预训练脚本: examples/bloom/pretrain_bloom_176b.sh

-    在集群中的每个节点上启动 examples/bloom/pretrain_bloom_176b.sh 脚本
+在集群中的每个节点上启动 examples/bloom/pretrain_bloom_176b.sh 脚本

-    ```shell
-    bash examples/bloom/pretrain_bloom_176b.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/bloom/pretrain_bloom_176b.sh
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ## 性能

@ -382,8 +385,7 @@ Bloom-176B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:


 ## 推理
-
-首先配置Bloom-176B 推理脚本: examples/bloom/generate_bloom_ptd_176B.sh
+首先配置Bloom-176B 推理脚本: tasks/inference/generate_bloom_ptd_176B.sh 
 bloom 176b的推理需要5机，因此要用上面的  权重转换脚本重新切分，tp=8，pp=5

 ```bash
@ -398,16 +400,16 @@ TOKENIZER_PATH="./model_from_hf/Bloom-176B/"
 然后可直接启动generate_bloom_176b_ptd.sh

 ```bash
-bash examples/bloom/generate_bloom_176b_ptd.sh
+bash tasks/inference/generate_bloom_176b_ptd.sh
 ```

 推理示例如下：

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom176b-generate.png)
+![Inference](../../sources/images/bloom/bloom176b-generate.png)

 ## 评估

-配置Bloom-176B 评估脚本: examples/bloom/evaluate_bloom_176b_ptd.sh
+配置Bloom-176B 评估脚本: tasks/evaluation/evaluate_bloom_176b_ptd.sh

 ```bash
 # ascend-toolkit 路径
@ -424,7 +426,7 @@ TASK="your task"
 启动评估

 ```bash
-bash examples/bloom/evaluate_bloom_176B_ptd.sh
+bash tasks/evaluation/evaluate_bloom_176B_ptd.sh
 ```

 评测得分
--- a/examples/bloom/README_en.md
+++ b/examples/bloom/README_en.md
@ -1,4 +1,4 @@
-# Bloom
+# Bloom  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 

 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
@ -6,6 +6,7 @@
    </p>
 </p>

+[toc]

 # Bloom-7B

@ -21,135 +22,137 @@ Here's a hardware summary of pre-training Bloom-7B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Prepare pretrained weights
-    Download the Bloom-7B checkpoint from [here](https://huggingface.co/bigscience/bloom-7b1/tree/main)
+Download the Bloom-7B checkpoint from [here](https://huggingface.co/bigscience/bloom-7b1/tree/main)

-    ```shell
-    mkdir ./model_from_hf/Bloom-7B/
-    cd ./model_from_hf/Bloom-7B/
-    cd tokenizer
-    wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-    ...
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Bloom-7B/
+cd ./model_from_hf/Bloom-7B/
+cd tokenizer
+wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
+...
+cd ../../
+```

 4. Weights convert

-    HuggingFace weights --> Megatron weights
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+HuggingFace weights --> Megatron weights
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader loader_bloom_hf \
-        --saver saver_megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Bloom-7B/ \
-        --save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model None 
-    ```
+```shell
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader loader_bloom_hf \
+    --saver saver_megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/Bloom-7B/ \
+    --save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
+    --tokenizer-model None 
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --embed-layernorm \
-        --save-dir ./model_from_hf/Bloom-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-7B/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --embed-layernorm \
+    --save-dir ./model_from_hf/Bloom-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-7B/mg2hg/
+```

 5. Prepare dataset

-    Download the Bloom-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
+Download the Bloom-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    # download datasets
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# download datasets
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # prepare datasets
-    mkdir ./dataset/Bloom-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
-        --output-prefix ./dataset/Bloom-7B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# prepare datasets
+mkdir ./dataset/Bloom-7B/
+python ./tools/preprocess_data.py \
+  --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+  --tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
+  --output-prefix ./dataset/Bloom-7B/alpaca \
+  --workers 4 \
+  --log-interval 1000 \
+  --tokenizer-type PretrainedFromHF
+```

 6. Config Bloom-7B pre-training script(Bloom-7B does not support Flash Attention) : examples/bloom/pretrain_bloom_ptd_7B.sh

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
-    DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
-    TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
-    CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
-    ```
+CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
+DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
+TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
+CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
+```

 7. Launch Bloom-7B  pre-training script: examples/bloom/pretrain_bloom_ptd_7B.sh

-    ```shell
-    bash examples/bloom/pretrain_bloom_ptd_7B.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```shell
+bash examples/bloom/pretrain_bloom_ptd_7B.sh 
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.


 ### Performance
@ -166,9 +169,7 @@ The performance of Bloom-7B in **Ascend NPU** and **Reference**:


 ## Inference Bloom-7B
-
-Config Bloom-7B inference script: examples/bloom/generate_bloom_7b_ptd.sh
-
+Config Bloom-7B inference script: tasks/inference/generate_bloom_7b_ptd.sh
 ```bash
 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -177,20 +178,17 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 CHECKPOINT="./model_weights/Bloom-7B-Base-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Bloom-7B-Base/"
 ```
-
-Launch Bloom-7B inference script: examples/bloom/generate_bloom_7b_ptd.sh
-
+Launch Bloom-7B inference script: tasks/inference/generate_bloom_7b_ptd.sh
 ```bash
-bash examples/bloom/generate_bloom_7b_ptd.sh
+bash tasks/inference/generate_bloom_7b_ptd.sh
 ```

 Some inference samples are as follows:

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom7b-generate.png)
+![Inference](../../sources/images/bloom/bloom7b-generate.png)

 ## Evaluation Bloom-7B
-
-Config Bloom-7B evaluation script: examples/bloom/evaluate_bloom_7B_ptd.sh
+Config Bloom-7B evaluation script: tasks/evaluation/evaluate_bloom_7B_ptd.sh

 ```bash
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -206,7 +204,7 @@ TASK="your task"
 Launch Bloom-7B evaluation script:

 ```bash
-bash examples/bloom/evaluate_bloom_7B_ptd.sh
+bash tasks/evaluation/evaluate_bloom_7B_ptd.sh
 ```

 Evaluation results
@ -236,142 +234,144 @@ Here's a hardware summary of pre-training Bloom-176B:

 1. Clone the repository to your local server

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build enviroment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Prepare pretrained weights

-    Download the Bloom-176B tokensizer from [here](https://huggingface.co/bigscience/bloom/tree/main).
+Download the Bloom-176B tokensizer from [here](https://huggingface.co/bigscience/bloom/tree/main).

-    ```shell
-    mkdir ./model_from_hf/Bloom-176B/
-    cd ./model_from_hf/Bloom-176B/
-    wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
-    wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
-    ...
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/Bloom-176B/
+cd ./model_from_hf/Bloom-176B/
+wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
+wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
+...
+cd ../../
+```

-4. Weights convert
+5. Weights convert

-    HuggingFace weights --> Megatron weights
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+HuggingFace weights --> Megatron weights
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader loader_bloom_hf \
-        --saver saver_megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 5 \
-        --load-dir ./model_from_hf/Bloom-176B/ \
-        --save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
-        --tokenizer-model None \
-        --params-dtype bf16  
-    ```
+```shell
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader loader_bloom_hf \
+    --saver saver_megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 5 \
+    --load-dir ./model_from_hf/Bloom-176B/ \
+    --save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
+    --tokenizer-model None \
+    --params-dtype bf16  
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --embed-layernorm \
-        --params-dtype bf16 \
-        --save-dir ./model_from_hf/Bloom-176B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-176B/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --embed-layernorm \
+    --params-dtype bf16 \
+    --save-dir ./model_from_hf/Bloom-176B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-176B/mg2hg/
+```

 5. Prepare dataset

-    Download the bloom-176b datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)    
+Download the bloom-176b datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)    

-    ```shell
-    # download datasets
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
+```shell
+# download datasets
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..

-    # process datasets  
-    mkdir ./dataset/Bloom-176B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
-        --output-prefix ./dataset/Bloom-176B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
+# process datasets  
+mkdir ./dataset/Bloom-176B/
+python ./tools/preprocess_data.py \
+  --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+  --tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
+  --output-prefix ./dataset/Bloom-176B/alpaca \
+  --workers 4 \
+  --log-interval 1000 \
+  --tokenizer-type PretrainedFromHF
+```

 6. Config Bloom-176B pre-training script(Bloom-176B does not support Flash Attention): examples/bloom/pretrain_bloom_176b.sh

-    ```shell
-    # modify MASTER_ADDR to the IP address of the master node in the cluster.
-    # the master node is localhost, and the other nodes are the IP address of the master node
-    MASTER_ADDR=localhost
+```shell
+# modify MASTER_ADDR to the IP address of the master node in the cluster.
+# the master node is localhost, and the other nodes are the IP address of the master node
+MASTER_ADDR=localhost

-    # modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
-    NODE_RANK=0
+# modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
+NODE_RANK=0

-    # modify the datasets path and tokenizer path
-    TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
-    DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
-    ```
+# modify the datasets path and tokenizer path
+TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
+DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
+```

 7. Launch Bloom-176B pre-training script: examples/bloom/pretrain_bloom_176b.sh

-    Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
+Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.

-    ```shell
-    bash examples/bloom/pretrain_bloom_176b.sh
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```shell
+bash examples/bloom/pretrain_bloom_176b.sh
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ## Performance

@ -386,9 +386,7 @@ The performance of Bloom-176B in **Ascend NPU** and **Reference**:


 ## Inference Bloom 176B
-
-Config Bloom-176B inference script: examples/bloom/generate_bloom_176b_ptd.sh
-
+Config Bloom-176B inference script: tasks/inference/generate_bloom_176b_ptd.sh
 ```bash
 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -397,23 +395,20 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 CHECKPOINT="./model_weights/Bloom-176B-v0.1-tp8-pp5/"
 TOKENIZER_PATH="./model_from_hf/Bloom-176B/"
 ```
-
-Launch Bloom-176B inference script: examples/bloom/generate_bloom_176b_ptd.sh
-
+Launch Bloom-176B inference script: tasks/inference/generate_bloom_176b_ptd.sh
 Bloom-176b needs 5 machines to inference, so you need to convert a new model, set
 tp=8, pp=5

 ```bash
-bash examples/bloom/generate_bloom_176b_ptd.sh
+bash tasks/inference/generate_bloom_176b_ptd.sh
 ```

 Some inference samples are as follows:

-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom176b-generate.png)
+![Inference](../../sources/images/bloom/bloom176b-generate.png)

 ## Evaluation Bloom 176B
-
-Config Bloom-176B evaluation script: examples/bloom/evaluate_bloom_176B_ptd.sh
+Config Bloom-176B evaluation script: tasks/evaluation/evaluate_bloom_176B_ptd.sh

 ```bash
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
@ -429,7 +424,7 @@ TASK="your task"
 Launch Bloom-176B evaluation script:

 ```bash
-bash examples/bloom/evaluate_bloom_176B_ptd.sh
+bash tasks/evaluation/evaluate_bloom_176B_ptd.sh
 ```

 Evaluation results
--- a/examples/bloom/pretrain_bloom_176b.sh
+++ b/examples/bloom/pretrain_bloom_176b.sh
@ -1,6 +1,7 @@
 #!/bin/bash
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 export HCCL_CONNECT_TIMEOUT=1200
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -49,6 +50,7 @@ GPT_ARGS="
    --lr 1.2e-4 \
    --train-iters 5000 \
    --init-method-std 0.0048 \
+    --optimize-recomp-communication-level 2 \
    --hidden-dropout 0.0 \
    --position-embedding-type alibi \
    --normalization LayerNorm \
@ -87,6 +89,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_bloom_176b.log
--- a/examples/bloom/pretrain_bloom_ptd_7B.sh
+++ b/examples/bloom/pretrain_bloom_ptd_7B.sh
@ -1,6 +1,7 @@
 #!/bin/bash
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 export HCCL_CONNECT_TIMEOUT=1200
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -44,16 +45,14 @@ GPT_ARGS="
    --make-vocab-size-divisible-by 1 \
    --attention-softmax-in-fp32 \
    --apply-query-key-layer-scaling \
-    --lr 1.2e-6 \
-    --train-iters 2000 \
+    --lr 1.2e-4 \
+    --train-iters 200 \
    --init-method-std 0.0048 \
    --hidden-dropout 0.0 \
-    --attention-dropout 0.0 \
    --position-embedding-type alibi \
    --normalization LayerNorm \
-    --min-lr 1e-8 \
-    --lr-decay-iters 430000 \
-    --lr-decay-style cosine \
+    --min-lr 6e-6 \
+    --lr-decay-iters 200 \
    --weight-decay 1e-1 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
@ -83,6 +82,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_bloom_7b.log
--- a/examples/chatglm3/README.md
+++ b/examples/chatglm3/README.md
@ -1,296 +0,0 @@
-# ChatGLM3
-
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-</p>
-
-# 目录
-
- [ChatGLM3](#ChatGLM3)
- [目录](#目录)
- [ChatGLM3-6B](#ChatGLM3-6B)
-  - [训练-6B](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-  - [推理-6B](#推理-6B)
-  - [评估-6B](#评估-6B)
-
-# ChatGLM3-6B
-
-## 训练
-
-ChatGLM3-6B 训练的硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd .. 
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # 安装 torch 和 torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip install -e .
-    cd ..
-    
-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
-3. 下载 ChatGLM3-6B 的 [预训练权重和词表](https://huggingface.co/THUDM/chatglm3-6b/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/chatglm3_6b_hf/
-    cd ./model_from_hf/chatglm3_6b_hf/
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/config.json
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/configuration_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/modeling_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00001-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00002-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00003-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00004-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00005-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00006-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00007-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/quantization.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenization_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer.model
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. 权重转换
-
-    4.1 将权重从 huggingface 格式转化为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```bash
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # 权重格式转换
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader chatglm3_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/chatglm3_6b_hf/ \
-        --save-dir ./model_weights/chatglm3_6b_tp1pp2/
-        --tokenizer-model ./model_from_hf/chatglm3_6b_hf/tokenizer.model \
-        --add-qkv-bias
-    ```
-
-    注意：chatglm3的--target-tensor-parallel-size跟config.json中的multi_query_attention配置有关，这里multi_query_attention设置的是2。
-
-    4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_chatglm3 \
-        --load-dir ./model_weights/chatglm3_6b_tp1pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --add-qkv-bias \
-        --save-dir ./model_from_hf/chatglm3_6b_hf/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/chatglm3_6b_hf/mg2hg/
-    ```
-
-5. 预训练
-
-    5.1 准备数据集
-
-    下载 ChatGLM3-6B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # 处理数据    
-    mkdir ./dataset/chatglm3_6b_hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
-        --output-prefix ./dataset/chatglm3_6b_hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 用ptd模式预训练
-    配置ChatGLM3-6B PTD 预训练脚本: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-
-    ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 根据实际情况配置词表、数据集、模型参数加载和保存路径
-    LOAD_CHECKPOINT_PATH="./model_weights/chatglm3_6b_tp1pp2/"
-    SAVE_CHECKPOINT_PATH="./ckpt/chatglm3_6b_hf/"
-    TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"  #词表路径
-    DATA_PATH="./dataset/chatglm3_6b_hf/alpaca_text_document"  #数据集路径
-    ```
-
-    多机运行增加参数--overlap-grad-reduce
-
-    启动 ChatGLM3-6B PTD预训练脚本: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-
-    ```shell
-    bash examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-    ```
-
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-6. 微调
-
-    6.1 准备微调数据集
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/chatglm3-6b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
-        --output-prefix ./finetune_dataset/chatglm3-6b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*
-
-    增加微调参数--finetune，增加权重加载参数--load，使微调从第一步开始。使用--tokenizer-padding-side left。修改tokenizer参数，更改为以下参数：
-
-    ```bash
-    DATA_PATH="./finetune_dataset/chatglm3-6b-hf/alpaca"
-    TOKENIZER_PATH="./model_from_hf/chatglm3-6b-hf/"
-    CKPT_LOAD_DIR="./model_weights/chatglm3_6b_tp1pp2/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-padding-side left \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-not-use-fast \
-    ```
-    启动 ChatGLM3-6B 全参微调脚本: examples/chatglm3/tune_chatglm3_6B_8K.sh
-
-    ```shell
-    bash examples/chatglm3/tune_chatglm3_6B_8K.sh
-    ```
-
-### 性能
-
-#### 吞吐
-
-ChatGLM3-6B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 |    模型    |  序列长度 |tokens吞吐 (tokens/s/p) | 
-| :--: | :--------: |:---------------------:| 
-| NPUs | ChatGLM3-6B |  8192 |       4297         |  
-| 参考 | ChatGLM3-6B |  8192 |       4269         |  
-
-## 推理
-
-我们在ChatGLM3_6B中支持推理来生成文本。
-推理不同于预训练，比如我们需要加载预训练检查点和输出样本的长度:
-
-配置 ChatGLM3-6B 推理脚本: examples/chatglm3/generate_chatglm3_6B.sh
-
-```shell
-# 修改模型权重路径以及词表路径
-CHECKPOINT="./model_weights/chatglm3_6b_tp1pp2/"
-TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"
-```
-
-启动推理脚本
-
-```shell
-bash ./examples/chatglm3/generate_chatglm3_6B.sh
-```
-
-推理结果示例如下:
-![ChatGLM3-6B-generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/chatglm3/ChatGLM3-6B-generate.png)
-
-
-## 评估
-
-使用mmlu基准来评估模型。mmlu基准[下载](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU/data/test).
-
-
-
-因评估代码限制，参考 4.1权重转换 设置--target-tensor-parallel-size 2 --target-pipeline-parallel-size 4做权重转换，保存新权重到chatglm3_6b_tp2pp4目录。
-
-配置chatglm3-6b评估脚本: examples/chatglm3/evaluate_chatglm3_6B.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"  #词表路径
-CHECKPOINT="./model_weights/chatglm3_6b_tp2pp4/"  #模型路径
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/chatglm3/evaluate_chatglm3_6B.sh
-```
-
-| 数据集 | 总学科数 | 总问题数 |                   参考准确率                   | NPU准确率 |
-|:---:|:---:|:---:|:-----------------------------------------:|:------:|
-| MMLU | 57 | 14042 | [61.4](https://github.com/THUDM/ChatGLM3) |  61.5  |
--- a/examples/chatglm3/README_en.md
+++ b/examples/chatglm3/README_en.md
@ -1,291 +0,0 @@
-# ChatGLM
-<p align="left">
-        <b><a href="README.md">简体中文</a></b> |
-        <b>English</b> 
-</p>
-
-#  Contents
-
- [ChatGLM3](#ChatGLM3)
- [Contents](#contents)
- [ChatGLM3-6B](#ChatGLM3-6b)
-  - [Training-6B](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-  - [Inference-6B](#inference-6b)
-  - [Evaluation-6B](#evaluation-6b)
-
-# ChatGLM3-6B
-
-## Training
-
-Here's a hardware summary of pre-training  ChatGLM3-6B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd .. 
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # modify ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip install -e .
-    cd ..
-    
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-3. Prepare pretrained weights and tokenizer
-    Download the ChatGLM3-6B checkpoint from [here](https://huggingface.co/THUDM/chatglm3-6b/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/chatglm3_6b_hf/
-    cd ./model_from_hf/chatglm3_6b_hf/
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/config.json
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/configuration_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/modeling_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00001-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00002-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00003-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00004-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00005-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00006-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00007-of-00007.bin
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/quantization.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenization_chatglm.py
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer.model
-    wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. weight conversion in ptd mode
-
-    4.1 Convert weights from HuggingFace format to Megatron format 
-    ***（This scenario is generally used to enable the open-source HuggingFace model to be trained on Megatron）***
-
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # convert to ptd weights
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader chatglm3_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/chatglm3_6b_hf/ \
-        --save-dir ./model_weights/chatglm3_6b_tp1pp2/ \
-        --tokenizer-model ./model_from_hf/chatglm3_6b_hf/tokenizer.model \
-        --add-qkv-bias
-    ```
-
-    Note: The --target-tensor-parallel-size of chatglm3 is related to the multi_query_attention configuration in the config.json, and the multi_query_attention set here is 2.
-
-    4.2 Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_chatglm3 \
-        --load-dir ./model_weights/chatglm3_6b_tp1pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --add-qkv-bias \
-        --save-dir ./model_from_hf/chatglm3_6b_hf/     # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/chatglm3_6b_hf/mg2hg/
-    ```
-
-5. pre-training
-
-    5.1 Prepare dataset 
-
-    Download the ChatGLM3-6B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # process datasets  
-    mkdir ./dataset/chatglm3_6b_hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
-        --output-prefix ./dataset/chatglm3_6b_hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 pre-training using ptd mode
-    Config ChatGLM3-6B pre-training script: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-
-    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # modify config according to your own actual situation
-    LOAD_CHECKPOINT_PATH="./model_weights/chatglm3_6b_tp1pp2/"
-    SAVE_CHECKPOINT_PATH="./ckpt/chatglm3_6b_hf/"
-    TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"  #tokenizer path
-    DATA_PATH="./dataset/chatglm3_6b_hf/alpaca_text_document"  #processed dataset
-    ```
-
-    Multi-machine training requires the addition of parameter --overlap-grad-reduce
-
-    Launch ChatGLM3-6B  pre-training script: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-
-    ```shell
-    bash examples/chatglm3/pretrain_chatglm3_6B_8K.sh
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
-
-6. fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-    Download the alpaca datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # process datasets  
-    mkdir ./finetune_dataset/chatglm3-6b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
-        --output-prefix ./finetune_dataset/chatglm3-6b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_chatglm3_6B_8K.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step. Use --tokenizer-padding-side left. 
-
-    ```bash
-    DATA_PATH="./finetune_dataset/chatglm3-6b-hf/alpaca"
-    TOKENIZER_PATH="./model_from_hf/chatglm3-6b-hf/"
-    CKPT_LOAD_DIR="./model_weights/chatglm3_6b_tp1pp2/"
-        --load ${CKPT_LOAD_DIR} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-padding-side left \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-not-use-fast \
-    ```
-
-    Launch ChatGLM3-6B finetune script: examples/chatglm3/tune_chatglm3_6B_8K.sh
-
-    ```shell
-    bash examples/chatglm3/tune_chatglm3_6B_8K.sh
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of ChatGLM3-6B in **Ascend NPU** and **Reference**:
-
-|  Device   |  Model  | sequence length | throughput rate (tokens/s/p) | 
-| :--: | :--------: | :--------:|:---------------------:| 
-| NPUs | ChatGLM3-6B | 8192 |       4297        |  
-| Reference | ChatGLM3-6B |  8192 |      4269         |  
-
-## Inference
-
-We support Inference for text generation with ChatGLM3_6B.
-Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
-
-Config ChatGLM3-6B inference script: examples/chatglm3/generate_chatglm3_6B.sh
-
-```shell
-# modify the model weight path and tokenizer path
-CHECKPOINT="./model_weights/chatglm3_6b_tp1pp2/"
-TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"
-```
-
-Launch ChatGLM3-6B inference script.
-
-```shell
-bash ./examples/chatglm3/generate_chatglm3_6B.sh
-```
-
-Some inference samples are as follows:
-![ChatGLM3-6B-generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/chatglm3/ChatGLM3-6B-generate.png)
-
-## Evaluation
-
-Use mmlu benchmark to evaluate our model. MMLU benchmark Download [here](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU/data/test).
-
-Config chatglm3-6b evaluation script: examples/chatglm3/evaluate_chatglm3_6B.sh
-
-```bash
-# ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Modify the model parameter path and vocabulary path
-TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"  # vocabulary path
-CHECKPOINT="./model_weights/chatglm3_6b_tp2pp4/"  # parameter path
-
-# Configure the task type and dataset path
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Launch chatglm3-6b evaluation
-
-```bash
-bash examples/chatglm3/evaluate_chatglm3_6B.sh
-```
-
-| Task | Subset | Question | OpenSource | NPU |
-|:---:|:---:|:---:|:-----------------------------------------:|:------:|
-| MMLU | 57 | 14042 | [61.4](https://github.com/THUDM/ChatGLM3) |  61.5  |
--- a/examples/chatglm3/evaluate_chatglm3_6B.sh
+++ b/examples/chatglm3/evaluate_chatglm3_6B.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-DATA_PATH="./mmlu/data/test"
-TASK="mmlu"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --max-position-embeddings 32768 \
-       --tensor-model-parallel-size 2  \
-       --pipeline-model-parallel-size 4  \
-       --num-layers 28  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 13696 \
-       --num-attention-heads 32  \
-       --group-query-attention \
-       --num-query-groups 2 \
-       --disable-bias-linear \
-       --add-qkv-bias \
-       --swiglu \
-       --padded-vocab-size 65024 \
-       --make-vocab-size-divisible-by 1 \
-       --position-embedding-type rope \
-       --use-partial-rope \
-       --load $CHECKPOINT \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --fp16  \
-       --micro-batch-size 1  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --untie-embeddings-and-output-weights \
-       --seed 42 \
-       | tee logs/eval_chatglm3_6B_${TASK}.log
--- a/examples/chatglm3/generate_chatglm3_6B.sh
+++ b/examples/chatglm3/generate_chatglm3_6B.sh
@ -1,62 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-TOKENIZER_MODEL="your tokenizer.model file path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=2
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 28  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 13696 \
-       --seq-length 8192 \
-       --group-query-attention \
-       --num-query-groups 2 \
-       --num-attention-heads 32  \
-       --padded-vocab-size 65024 \
-       --make-vocab-size-divisible-by 1 \
-       --max-position-embeddings 32768 \
-       --position-embedding-type rope \
-       --use-partial-rope \
-       --disable-bias-linear \
-       --add-qkv-bias \
-       --swiglu \
-       --normalization RMSNorm \
-       --max-new-tokens 256 \
-       --micro-batch-size 1 \
-       --global-batch-size 16 \
-       --load "${CHECKPOINT}"  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path "${TOKENIZER_PATH}" \
-       --tokenizer-model "${TOKENIZER_MODEL}"  \
-       --tokenizer-not-use-fast \
-       --untie-embeddings-and-output-weights \
-       --attention-softmax-in-fp32 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-masked-softmax-fusion \
-       --no-gradient-accumulation-fusion \
-       --exit-on-missing-checkpoint \
-       --seed 42 \
-       --fp16 \
-       | tee logs/generate_chatglm3_6B.log
-
-
--- a/examples/chatglm3/pretrain_chatglm3_6B_8K.sh
+++ b/examples/chatglm3/pretrain_chatglm3_6B_8K.sh
@ -1,96 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_PATH="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=1
-PP=2
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --num-layers 28 \
-    --hidden-size 4096 \
-    --ffn-hidden-size 13696 \
-    --num-attention-heads 32 \
-    --seq-length 8192 \
-    --micro-batch-size 1 \
-    --global-batch-size 128 \
-    --max-position-embeddings 32768 \
-    --padded-vocab-size 65024 \
-    --make-vocab-size-divisible-by 1 \
-    --group-query-attention \
-    --num-query-groups 2 \
-    --disable-bias-linear \
-    --add-qkv-bias \
-    --position-embedding-type rope \
-    --use-partial-rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --swiglu \
-    --use-fused-swiglu \
-    --use-flash-attn \
-    --use-distributed-optimizer \
-    --use-mc2 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_PATH} \
-    --lr 1e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --untie-embeddings-and-output-weights \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1e-8 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --initial-loss-scale 4096 \
-    --adam-beta2 0.95 \
-    --no-gradient-accumulation-fusion \
-    --load ${CKPT_LOAD_DIR}  \
-    --no-load-optim \
-    --no-load-rng \
-    --fp16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 949,50,1
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 1000 \
-    --eval-iters 10 \
-"
-
-python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --save $CKPT_SAVE_DIR \
-    | tee logs/train_chatglm3_6B_8K.log
--- a/examples/chatglm3/tune_chatglm3_6B_8K.sh
+++ b/examples/chatglm3/tune_chatglm3_6B_8K.sh
@ -1,99 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_PATH="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=1
-PP=2
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --num-layers 28 \
-    --hidden-size 4096 \
-    --ffn-hidden-size 13696 \
-    --num-attention-heads 32 \
-    --seq-length 8192 \
-    --micro-batch-size 1 \
-    --global-batch-size 128 \
-    --max-position-embeddings 32768 \
-    --padded-vocab-size 65024 \
-    --make-vocab-size-divisible-by 1 \
-    --group-query-attention \
-    --num-query-groups 2 \
-    --disable-bias-linear \
-    --add-qkv-bias \
-    --position-embedding-type rope \
-    --use-partial-rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --swiglu \
-    --use-fused-swiglu \
-    --use-distributed-optimizer \
-    --use-mc2 \
-    --finetune \
-    --is-instruction-dataset \
-    --tokenizer-padding-side left \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-not-use-fast \
-    --tokenizer-name-or-path ${TOKENIZER_PATH} \
-    --lr 1e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --untie-embeddings-and-output-weights \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1e-8 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --use-flash-attn \
-    --adam-beta1 0.9 \
-    --initial-loss-scale 4096 \
-    --adam-beta2 0.95 \
-    --no-gradient-accumulation-fusion \
-    --load ${CKPT_LOAD_DIR}  \
-    --no-load-optim \
-    --no-load-rng \
-    --fp16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 949,50,1
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 1000 \
-    --eval-iters 10 \
-"
-python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --save $CKPT_SAVE_DIR \
-    | tee logs/tune_chatglm3_6B_8K.log
--- a/examples/codellama/README.md
+++ b/examples/codellama/README.md
@ -1,295 +0,0 @@
-# CodeLlama  $\color{black}{\bf\tiny{【社区贡献模型】}}$
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-    </p>
-</p>
-
-
-
-#  目录
-
- [CodeLlama-34B](#codellama-34b)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-  - [推理](#推理)
-  - [评估](#评估)
-
-# CodeLlama-34B
-
-## 训练
-
-CodeLlama-34B 训练的硬件配置如下:
-
-|  硬件 |       配置        |
-|:---:|:---------------:|
-| NPU | 16 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # 安装 torch 和 torch_npu 
-    pip install torch-2.2.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.2.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip3 install -e .
-    cd ..
-
-    # 安装其余依赖库
-    pip install -r requirements.txt 
-
-    ```
-
-
-3. （可选的）准备预训练权重
-
-    从 [huggingface](https://huggingface.co/codellama/CodeLlama-34b-hf/tree/main) 下载预训练权重
-
-    ```shell
-    mkdir ./model_from_hf/CodeLlama-34B/
-    cd ./model_from_hf/CodeLlama-34B/
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/config.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/generation_config.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00001-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00002-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00003-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00004-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00005-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00006-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00007-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.model
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-
-4. 权重转换
-
-    4.1 将 CodeLlama-34B 模型权重从 huggingface 格式转换为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-      
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/CodeLlama-34B/ \
-        --save-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
-        --tokenizer-model ./model_from_hf/CodeLlama-34B/tokenizer.model \
-        --params-dtype bf16
-    ```
-    如果为单机8卡推理或者评估任务，将`--target-pipeline-parallel-size`值设为`1`，将`--save-dir`值中的`pp2`改为`pp1`.
-
-    4.2 任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/CodeLlama-34B/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/CodeLlama-34B/mg2hg/
-    ```
-
-5. 预训练
-   
-   5.1 准备数据集
-
-    下载 CodeLlama-34B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    mkdir ./dataset/CodeLlama-34B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
-        --output-prefix ./dataset/CodeLlama-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF 
-    ```
-
-    5.2 预训练
-
-    配置 CodeLlama-34B 训练脚本: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
-    DATA_PATH="./dataset/CodeLlama-34B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/CodeLlama-34B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/" 
-    ```
-
-   启动 CodeLlama-34B 训练脚本: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
-
-    ```bash
-    bash examples/codellama/pretrain_codellama_34b_ptd_16p.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-6. 微调
-
-    6.1 准备微调数据集
-    
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/CodeLlama-34B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
-        --output-prefix ./finetune_dataset/CodeLlama-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数`--is-instruction-dataset`和`--padded-vocab-size 32000`*
-
-    增加微调参数`--finetune`，使微调从第一步开始。修改tokenizer参数，去掉`--tokenizer-type Llama2Tokenizer` 和`--tokenizer-model ${TOKENIZER_MODEL}`，更改为以下参数：
-
-    ```bash
-    DATA_PATH="./finetune_dataset/CodeLlama-34B/alpaca"
-    TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
-    CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
-    CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/" 
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ${TOKENIZER_PATH} \
-        --tokenizer-not-use-fast \
-        --padded-vocab-size 32000 \
-    ```
-
-### 性能
-
-#### 吞吐
-
-CodeLlama-34B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
-
-|  设备  |      模型      | 迭代数  | 样本吞吐 (samples/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 
-|:----:|:------------:|:----:|:------------------:|:--------------------:|:---------------:|
-| NPUs | CodeLlama-34B | - | 3.27 | 837 | 313 |      
-|  参考  | CodeLlama-34B | - |  2.97   | 762 | 344 |    
-
-
-## 推理
-
-配置CodeLlama-34B的推理脚本: examples/codellama/generate_codellama_34b_ptd.sh
-
-```bash
-# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
-```
-
-然后可直接启动generate_codellama_34b_ptd.sh
-
-```bash
-bash examples/codellama/generate_codellama_34b_ptd.sh
-```
-
-推理的示例如下:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/codellama/codellama-34b-generate.png)
-
-## 评估
-
-我们使用human_eval基准来评估我们的模型。基准[下载](https://github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz).
-
-```shell
-# 配置原始权重与词表的路径
-CHECKPOINT=<origin-ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-# 配置任务以及数据路径
-DATA_PATH="./human_eval/"
-TASK="human_eval"
-```
-
-```shell
-bash ./examples/codellama/evaluate_codellama_34b_ptd.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>任务</th>
-      <th>模型</th>
-      <th>昇腾值</th>
-      <th>社区值</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/openai_humaneval">human_eval</a></td>
-      <th>CodelLlama 34B</th>
-      <td>0.4878</td>
-      <td><a href="https://paperswithcode.com/sota/code-generation-on-humaneval">0.488</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/codellama/README_en.md
+++ b/examples/codellama/README_en.md
@ -1,297 +0,0 @@
-# CodeLlama $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$ 
-<p align="left">
-        <b><a href="README.md">简体中文</a></b> |
-        <b>English</b> 
-    </p>
-</p>
-
-
-#  Contents
-
- [CodeLlama-34B](#codellama-34b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-  - [Inference](#inference)
-  - [Evaluation](#evaluation)
-
-
-
-
-# CodeLlama-34B
-
-## Training
-
-Here's a hardware summary of pre-training CodeLlama-34B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               16 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch and torch_npu 
-    pip install torch-2.2.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.2.0.XXX-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-
-3. Prepare pretrained weights
-
-    Download the CodeLlama-34B checkpoint from [here](https://huggingface.co/codellama/CodeLlama-34b-hf/tree/main) 
-
-    ```shell
-    mkdir ./model_from_hf/CodeLlama-34B/
-    cd ./model_from_hf/CodeLlama-34B/
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/config.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/generation_config.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00001-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00002-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00003-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00004-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00005-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00006-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00007-of-00007.bin
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model.bin.index.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.json
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.model
-    wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-
-4. Weights convert
-
-    4.1 In order to adapt to the CodeLlama-34B model, the following script is used to convert the model pre-training weights.
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```shell
-    # modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-      
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/CodeLlama-34B/ \
-        --save-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
-        --tokenizer-model ./model_from_hf/CodeLlama-34B/tokenizer.model \
-        --params-dtype bf16
-    ```
-    For inference or evaluation tasks, set the `--target-pipeline-parallel-size` value to `1` and change the `pp2` value to `pp1` in the `--save-dir` value.
-
-    4.2 Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/CodeLlama-34B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/CodeLlama-34B/mg2hg/
-    ```
-
-5. Pre-training
-   
-    5.1 Prepare dataset
-
-    Download the CodeLlama-34B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets          
-    mkdir ./dataset/CodeLlama-34B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
-        --output-prefix ./dataset/CodeLlama-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 Pre-training
-    
-    Config CodeLlama-34B pre-training script : examples/codellama/pretrain_codellama_34b_ptd_16p.sh
-
-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
-    DATA_PATH="./dataset/CodeLlama-34B/alpaca_text_document"
-    TOKENIZER_MODEL="./model_from_hf/CodeLlama-34B/tokenizer.model"
-    CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-v0.1-tp8-pp2/"
-    ```
-
-    Launch CodeLlama-34B  pre-training script: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
-
-    ```shell
-    bash examples/codellama/pretrain_codellama_34b_ptd_16p.sh 
-    ```
-    **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
-
-6. Fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-
-    Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets  
-    mkdir ./finetune_dataset/CodeLlama-34B/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
-        --output-prefix ./finetune_dataset/CodeLlama-34B/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_codellama_34b_ptd_16p.sh.*The difference is that the dataset and the training parameter `is-instruction-dataset` and `padded-vocab-size 32000` are added.*
-
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-
-    ```bash
-    DATA_PATH="./finetune_dataset/CodeLlama-34B/alpaca"
-    TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
-    CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
-    CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/" 
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ${TOKENIZER_PATH} \
-        --tokenizer-not-use-fast \
-        --padded-vocab-size 32000 \
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of CodeLlama-34B in **Ascend NPU** and **Reference**:
-
-| Device | Model       | total Iterations | throughput rate (samples/s) | throughput rate (tokens/s/p) | single-step time (s/step) | 
-|:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
-| NPUs | CodeLlama-34B | - | 3.27 | 837 | 313 |
-|  Reference  | CodeLlama-34B | - |  2.97   | 762 | 344 | 
-
-
-## Inference
-
-Config CodeLlama-34B inference script: examples/codellama/generate_codellama_34b_ptd.sh
-
-```bash
-# modify the script according to your own ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# modify script model path and tokenizer path
-CHECKPOINT="./model_weights/CodeLlama-34B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
-```
-
-Launch CodeLlama-34B inference script: examples/codellama/generate_codellama_34b_ptd.sh
-
-```bash
-bash examples/codellama/generate_codellama_34b_ptd.sh
-```
-
-Some inference samples are as follows:
-
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/codellama/codellama-34b-generate.png)
-
-## Evaluation
-
-We use the boolq benchmark to evaluate our model. Benchmark [Download](https://github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz).
-
-```shell
-# config origin weight and vocab file path
-CHECKPOINT=<origin-ckpt-path>
-TOKENIZER_PATH=<tokenizer-path>
-# config tasks and dataset path
-DATA_PATH="./human_eval/"
-TASK="human_eval"
-```
-
-```shell
-bash ./examples/codellama/evaluate_codellama_34b_ptd.sh
-```
-
-<table>
-  <thead>
-    <tr>
-      <th>Task</th>
-      <th>Model</th>
-      <th>NPU</th>
-      <th>OpenSource</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><a href="https://huggingface.co/datasets/openai_humaneval">human_eval</a></td>
-      <th>CodelLlama 34B</th>
-      <td>0.4878</td>
-      <td><a href="https://paperswithcode.com/sota/code-generation-on-humaneval">0.488</a></td>
-    </tr>
-  </tbody>
-</table>
--- a/examples/codellama/evaluate_codellama_34b_ptd.sh
+++ b/examples/codellama/evaluate_codellama_34b_ptd.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="Your ckpt file path"
-TOKENIZER_PATH="Your tokenizer path"
-DATA_PATH="./human_eval/"
-TASK="human_eval"
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task $TASK\
-       --seq-length 4096 \
-       --max-new-tokens 1024 \
-       --max-position-embeddings 16384 \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 48  \
-       --hidden-size 8192  \
-       --ffn-hidden-size 22016 \
-       --num-attention-heads 64  \
-       --disable-bias-linear \
-       --swiglu \
-       --position-embedding-type rope \
-       --load ${CHECKPOINT}  \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --fp16  \
-       --micro-batch-size 1  \
-       --use-fused-rmsnorm \
-       --exit-on-missing-checkpoint \
-       --padded-vocab-size 32000 \
-       --no-load-rng \
-       --no-load-optim \
-       --untie-embeddings-and-output-weights \
-       --no-masked-softmax-fusion \
-       --make-vocab-size-divisible-by 1 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --rotary-base 1000000 \
-       --instruction-template "{prompt}" \
-       --seed 42  | tee logs/evaluation_codellama_34b_${TASK}.log
--- a/examples/codellama/generate_codellama_34b_ptd.sh
+++ b/examples/codellama/generate_codellama_34b_ptd.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 48 \
-       --hidden-size 8192 \
-       --ffn-hidden-size 22016 \
-       --position-embedding-type rope \
-       --seq-length 4096 \
-       --max-new-tokens 256 \
-       --micro-batch-size 1 \
-       --global-batch-size 8 \
-       --num-attention-heads 64 \
-       --max-position-embeddings 16384 \
-       --swiglu \
-       --load "${CHECKPOINT}" \
-       --tokenizer-type PretrainedFromHF \
-       --tokenizer-name-or-path "${TOKENIZER_PATH}" \
-       --tokenizer-not-use-fast \
-       --fp16 \
-       --normalization RMSNorm \
-       --untie-embeddings-and-output-weights \
-       --disable-bias-linear \
-       --attention-softmax-in-fp32 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-masked-softmax-fusion \
-       --no-gradient-accumulation-fusion \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 32 \
-       --vocab-size 32000 \
-       --padded-vocab-size 32000 \
-       --rotary-base 1000000 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       | tee logs/generate_codellama_34b.log
-
-
--- a/examples/codellama/pretrain_codellama_34b_ptd_16p.sh
+++ b/examples/codellama/pretrain_codellama_34b_ptd_16p.sh
@ -1,98 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export NPU_ASD_ENABLE=0
-
-NPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=2
-NODE_RANK=0
-WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=8
-PP=2
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --num-layers 48 \
-    --hidden-size 8192 \
-    --ffn-hidden-size 22016 \
-    --num-attention-heads 64 \
-    --tokenizer-type Llama2Tokenizer \
-    --tokenizer-model ${TOKENIZER_MODEL} \
-    --seq-length 4096 \
-    --max-position-embeddings 16384 \
-    --micro-batch-size 2 \
-    --global-batch-size 1024 \
-    --make-vocab-size-divisible-by 1 \
-    --lr 1.0e-7 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --use-fused-rotary-pos-emb \
-    --use-rotary-position-embeddings \
-    --use-fused-swiglu \
-    --use-mc2 \
-    --swiglu \
-    --use-flash-attn \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1.0e-8 \
-    --weight-decay 1e-2 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.999 \
-    --initial-loss-scale 8188.0 \
-    --no-gradient-accumulation-fusion \
-    --load ${CKPT_LOAD_DIR}  \
-    --no-load-optim \
-    --no-load-rng \
-    --rotary-base 1000000 \
-    --vocab-size 32000 \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 2000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --save $CKPT_SAVE_DIR \
-    | tee logs/train_codellama_34b.log
--- a/examples/gemma/README.md
+++ b/examples/gemma/README.md
@ -1,564 +0,0 @@
-# Gemma  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
-
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-</p>
-
-# 目录
-
- [Gemma-2B](#Gemma-2B)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-  - [推理](#推理)
-  - [评估](#评估)
-
- [Gemma-7B](#Gemma-7B)
-  - [训练](#训练)
-  - [脚本](#脚本)
-  - [性能](#性能)
-    - [吞吐](#吞吐)
-  - [推理](#推理)
-  - [评估](#评估)
-
-# Gemma-2B
-
-## 训练
-
-Gemma-2B 训练的硬件配置:
-
-| 硬件  |      配置      |
-|:---:| :-------------: |
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # 安装 torch 和 torch_npu
-    pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # 安装其余依赖库
-    pip install -r requirements.txt
-    ```
-3. 下载 Gemma-2B 的 [预训练权重和词表](https://huggingface.co/google/gemma-2b/tree/main)
-
-    ```bash
-    mkdir ./model_from_hf/Gemma-2B/
-    cd ./model_from_hf/Gemma-2B/
-    wget https://huggingface.co/google/gemma-2b/resolve/main/config.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/generation_config.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model-00001-of-00002.safetensors
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model-00002-of-00002.safetensors
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model.safetensors.index.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.model
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. 权重转换
-
-    将权重从 huggingface 格式转化为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader gemma_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/Gemma-2B/ \
-        --save-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
-        --tokenizer-model ./model_from_hf/Gemma-2B/tokenizer.model
-    ```
-
-   任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-   ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```bash
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_gemma \
-        --load-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Gemma-2B/     # 需要填入原始HF模型路径，新权重会存于./model_from_hf/Gemma-2B/mg2hg/
-    ```
-5. 准备数据集
-
-    下载 Gemma-2B [数据集](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
-    cd ..
-
-    # 处理数据
-    mkdir ./dataset/Gemma-2B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/wikipedia-cn-20230720-filtered.json \
-        --output-prefix ./dataset/Gemma-2B/wikipedia_cn \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
-        --json-key completion \
-        --workers 16 \
-        --log-interval 1000
-    ```
-6. 预训练
-   
-    配置Gemma-2B 预训练脚本: examples/gemma/pretrain_gemma_2b_ptd.sh
-
-    ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
-    TOKENIZER_MODEL="./model_from_hf/Gemma-2B/"  #词表路径
-    DATA_PATH="./dataset/Gemma-2B/wikipedia_cn_completion_document"  #数据集路径
-    CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
-    ```
-
-    启动 Gemma-2B 预训练脚本: examples/gemma/pretrain_gemma_2b_ptd.sh
-
-    ```shell
-    bash examples/gemma/pretrain_gemma_2b_ptd.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-
-7. 微调
-
-    7.1 准备微调数据集
-    下载微调数据集 [这里](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
-
-    ```bash
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip  --no-check-certificate
-    unzip moss-003-sft-no-tools.jsonl.zip
-    cd ..
-
-    # 处理数据集  
-    mkdir ./finetune_dataset/Gemma-2B/
-    python tools/preprocess_data.py \
-        --input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
-        --output-prefix ./finetune_dataset/Gemma-2B/moss \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
-        --tokenizer-not-use-fast \
-        --handler-name MOSSInstructionHandler
-    ```
-   
-    7.2 全参微调
-
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*
-
-    增加微调参数--finetune，使微调从第一步开始。
-
-    ```bash
-    CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
-    DATA_PATH="./finetune_dataset/Gemma-2B/moss"
-    TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
-    CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/" 
-    
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-   
-### 性能
-
-#### 吞吐
-
-Gemma-2B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-|  设备  |    模型    | tokens吞吐 (tokens/s/p) |
-|:----:|:--------:|:---------------------:|
-| NPUs | Gemma-2B |         6821          |
-|  参考  | Gemma-2B |         7602          |
-
-
-## 推理
-
-配置 Gemma-2B 推理脚本：examples/gemma/generate_gemma_2b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
-TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
-```
-
-启动Gemma-2B推理脚本
-
-```bash
-bash examples/gemma/generate_gemma_2b_ptd.sh
-```
-
-## 评估
-
-使用[MMLU数据集](https://huggingface.co/datasets/cais/mmlu)评估模型.
-
-配置Gemma-2b评估脚本: examples/gemma/evaluate_gemma_2b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-TOKENIZER_PATH="./model_from_hf/Gemma-2B/"  #词表路径
-CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"  #模型路径
-
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/gemma/evaluate_gemma_2b_ptd.sh
-```
-
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-|:---:|:---:|:---:|:-----:|:------:|
-| MMLU | 57 | 14042 | 39.7  |  39.4  |
-
-
-# Gemma-7B
-
-## 训练
-
-Gemma-7B 训练的硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # 安装 torch 和 torch_npu
-    pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # 安装其余依赖库
-    pip install -r requirements.txt
-    ```
-3. 下载 Gemma-7B 的 [预训练权重和词表](https://huggingface.co/Gemma/Gemma-7B/tree/main)
-
-    ```bash
-    mkdir ./model_from_hf/Gemma-7B/
-    cd ./model_from_hf/Gemma-7B/
-    wget https://huggingface.co/google/gemma-7b/resolve/main/config.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00001-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00002-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00003-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00004-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model.safetensors.index.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. 权重转换
-
-    将权重从 huggingface 格式转化为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader gemma_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Gemma-7B/ \
-        --save-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Gemma-7B/tokenizer.model
-    ```
-
-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```bash
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_gemma \
-        --load-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Gemma-7B/     # 需要填入原始HF模型路径，新权重会存于./model_from_hf/Gemma-7B/mg2hg/
-    ```
-5. 准备数据集
-
-    下载 Gemma-7B [数据集](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
-    cd ..
-
-    # 处理数据
-    mkdir ./dataset/Gemma-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/wikipedia-cn-20230720-filtered.json \
-        --output-prefix ./dataset/Gemma-7B/wikipedia_cn \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
-        --json-key completion \
-        --workers 16 \
-        --log-interval 1000
-    ```
-6. 预训练
-
-    配置Gemma-7B 预训练脚本: examples/gemma/pretrain_gemma_7b_ptd.sh
-
-    ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
-    TOKENIZER_MODEL="./model_from_hf/Gemma-7B/"  #词表路径
-    DATA_PATH="./dataset/Gemma-7B/wikipedia_cn_completion_document"  #数据集路径
-    CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
-    ```
-
-    启动 Gemma-7B 预训练脚本: examples/gemma/pretrain_gemma_7b_ptd.sh
-
-    ```shell
-    bash examples/gemma/pretrain_gemma_7b_ptd.sh
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-
-7. 微调
-
-    7.1 准备微调数据集
-    下载微调数据集 [这里](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
-
-    ```bash
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip  --no-check-certificate
-    unzip moss-003-sft-no-tools.jsonl.zip
-    cd ..
-
-    # 处理数据集  
-    mkdir ./finetune_dataset/Gemma-7B/
-    python tools/preprocess_data.py \
-        --input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
-        --output-prefix ./finetune_dataset/Gemma-7B/moss \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
-        --tokenizer-not-use-fast \
-        --handler-name MOSSInstructionHandler
-    ```
-   
-    7.2 全参微调
-
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*
-
-    增加微调参数--finetune，使微调从第一步开始。
-
-    ```bash
-    CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
-    DATA_PATH="./finetune_dataset/Gemma-7B/moss"
-    TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
-    CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/" 
-    
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-  
-    7.3 Lora微调
-
-    Lora微调的脚本配置是在全参微调脚本基础上加上lora参数，如下所示:
-    
-    ```bash
-        --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
-        --lora-r 16 \
-        --lora-alpha 32 \
-    ```
-   
-    如果模型的词表变化了，可以加上以下参数（词表不变不建议添加）
-
-    ```bash
-        --lora-modules-to-save word_embeddings output_layer \
-    ```
-   
-    添加下列参数，用于从上一个检查点恢复Lora模型继续训练:
-
-    ```bash
-        --load ${ORIGIN_CHECKPOINT}  \
-        --lora-load ${LORA_CHECKPOINT} \
-    ```
-   
-    启动Lora微调脚本: examples/gemma/tune_gemma_7b_ptd.sh
-    
-    ```shell
-    bash examples/gemma/tune_gemma_7b_ptd.sh
-    ```
-   
-### 性能
-
-#### 吞吐
-
-Gemma-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-|  设备  |   模型    | tokens吞吐 (tokens/s/p) |
-|:------:|:-------:|:---------------------:|
-|  NPUs  | Gemma-7B |         2938          |
-|  参考  | Gemma-7B |         2607          |
-
-
-## 推理
-
-配置 Gemma-7B 推理脚本：examples/gemma/generate_gemma_7b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
-```
-
-配置 Gemma-7B lora推理脚本: examples/gemma/generate_gemma_7b_lora_ptd.sh
-
-```bash
-# 修改lora权重路径
-CHECKPOINT_LORA="your lora model directory path"
-```
-
-启动Gemma-7B推理脚本
-
-```bash
-bash examples/gemma/generate_gemma_7b_ptd.sh
-```
-
-启动Gemma-7B lora推理脚本
-
-```bash
-bash examples/gemma/generate_gemma_7b_lora_ptd.sh
-```
-
-Lora推理的示例如下:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/gemma/gemma-7b-lora-inference.jpg)
-
-## 评估
-
-使用[MMLU数据集](https://huggingface.co/datasets/cais/mmlu)评估模型.
-
-配置Gemma-7B评估脚本: examples/gemma/evaluate_gemma_7b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-TOKENIZER_PATH="./model_from_hf/Gemma-7B/"  #词表路径
-CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"  #模型路径
-
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/gemma/evaluate_gemma_7b_ptd.sh
-```
-
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-|:---:|:---:|:---:|:-----:|:------:|
-| MMLU | 57 | 14042 | 52.2  |  52.2  |
--- a/examples/gemma/README_en.md
+++ b/examples/gemma/README_en.md
@ -1,564 +0,0 @@
-# Gemma  $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
-<p align="left">
-        <b><a href="README.md">简体中文</a></b> |
-        <b>English</b> 
-</p>
-
-#  Contents
-
- [Gemma-2B](#Gemma-2B)
-  - [Training](#training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-  - [Inference](#Inference)
-  - [Evaluation](#Evaluation)
-
- [Gemma-7B](#Gemma-7B)
-  - [Training](#training)
-  - [Script](#script)
-  - [Performance](#performance)
-    - [Machine performance](#machine-performance)
-  - [Inference](#Inference)
-  - [Evaluation](#Evaluation)
-
-# Gemma-2B
-
-## Training
-
-Here's a hardware summary of pre-training  Gemma-2B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch and torch_npu
-    pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt
-    ```
-3. Prepare pretrained weights and tokenizer
-
-    Download the Gemma-2B checkpoint from [here](https://huggingface.co/google/gemma-2b/tree/main)
-
-    ```bash
-    mkdir ./model_from_hf/Gemma-2B/
-    cd ./model_from_hf/Gemma-2B/
-    wget https://huggingface.co/google/gemma-2b/resolve/main/config.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/generation_config.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model-00001-of-00002.safetensors
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model-00002-of-00002.safetensors
-    wget https://huggingface.co/google/gemma-2b/resolve/main/model.safetensors.index.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.json
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.model
-    wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. Weights convert
-
-    Convert weights from huggingface format to megatron format
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader gemma_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 2 \
-        --load-dir ./model_from_hf/Gemma-2B/ \
-        --save-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
-        --tokenizer-model ./model_from_hf/Gemma-2B/tokenizer.model
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_gemma \
-        --load-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Gemma-2B/   # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Gemma-2B/mg2hg/
-    ```
-5. Prepare dataset
-
-    Download the Gemma-2B datasets from [here](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
-    cd ..
-
-    # process datasets  
-    mkdir ./dataset/Gemma-2B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/wikipedia-cn-20230720-filtered.json \
-        --output-prefix ./dataset/Gemma-2B/wikipedia_cn \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
-        --json-key completion \
-        --workers 16 \
-        --log-interval 1000
-    ```
-6. pre-training
-
-    Config Gemma-2B pre-training script: examples/gemma/pretrain_gemma_2b_ptd.sh
-
-    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # modify config according to your own actual situation
-    CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
-    TOKENIZER_MODEL="./model_from_hf/Gemma-2B/"  #tokenizer path
-    DATA_PATH="./dataset/Gemma-2B/wikipedia_cn_completion_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
-    ```
-
-    Launch Gemma-2B pre-training script: examples/gemma/pretrain_gemma_2b_ptd.sh
-
-    ```shell
-    bash examples/gemma/pretrain_gemma_2b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
-
-
-7. fine-tuning
-
-    7.1 Prepare fine-tuning dataset
-    
-    Download the fine-tuning datasets from [here](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
-    
-    ```bash
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip  --no-check-certificate
-    unzip moss-003-sft-no-tools.jsonl.zip
-    cd ..
-
-    # process datasets  
-    mkdir ./finetune_dataset/Gemma-2B/  
-    python tools/preprocess_data.py \
-        --input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
-        --output-prefix ./finetune_dataset/Gemma-2B/moss \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
-        --tokenizer-not-use-fast \
-        --handler-name MOSSInstructionHandler
-    ```
-   
-    7.2 Full Parameters Fine-Tuning
-
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_gemma_2b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-    
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-    ```bash
-    CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
-    DATA_PATH="./finetune_dataset/Gemma-2B/moss"
-    TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
-    CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/" 
-    
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of Gemma-2B in **Ascend NPU** and **Reference**:
-
-|  Device   |  Model   | throughput rate (tokens/s/p) |
-|:---------:|:--------:|:----------------------------:|
-|   NPUs    | Gemma-2B |             6821             |
-| Reference | Gemma-2B |             7602             |
-
-## Inference
-
-Config Gemma-2B inference script: examples/gemma/generate_gemma_2b_ptd.sh
-
-```bash
-# ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# modify script model path and tokenizer path
-CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
-TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
-```
-
-Launch Gemma-2B inference script: examples/gemma/generate_gemma_2b_ptd.sh
-
-```bash
-bash examples/gemma/generate_gemma_2b_ptd.sh
-```
-
-## Evaluation
-
-We use the [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
-
-Config Gemma-2b evaluation script: examples/gemma/evaluate_gemma_2b_ptd.sh
-
-```bash
-# ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Modify the model parameter path and vocabulary path
-TOKENIZER_PATH="./model_from_hf/Gemma-2B/"  # vocabulary path
-CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"  # parameter path
-
-# Configure the task type and dataset path
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Launch Gemma-2B evaluation
-
-```bash
-bash examples/gemma/evaluate_gemma_2b_ptd.sh
-```
-
-| Task | Subset | Question | OpenSource | NPU  |
-|:---:|:---:|:---:|:----------:|:----:|
-| MMLU | 57 | 14042 |    39.7    | 39.4 |
-
-
-# Gemma-7B
-
-## Training
-
-Here's a hardware summary of pre-training  Gemma-7B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch and torch_npu
-    pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt
-    pip install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt
-    ```
-3. Prepare pretrained weights and tokenizer
-    Download the Gemma-7B checkpoint from [here](https://huggingface.co/Gemma/Gemma-7B/tree/main)
-
-    ```bash
-    mkdir ./model_from_hf/Gemma-7B/
-    cd ./model_from_hf/Gemma-7B/
-    wget https://huggingface.co/google/gemma-7b/resolve/main/config.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00001-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00002-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00003-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model-00004-of-00004.safetensors
-    wget https://huggingface.co/google/gemma-7b/resolve/main/model.safetensors.index.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.json
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
-4. Weights convert
-
-    Convert weights from huggingface format to megatron format
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader gemma_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/Gemma-7B/ \
-        --save-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Gemma-7B/tokenizer.model
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_gemma \
-        --load-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/Gemma-7B/   # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Gemma-7B/mg2hg/
-    ```
-5. Prepare dataset
-
-    Download the Gemma-7B datasets from [here](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
-    cd ..
-
-    # process datasets  
-    mkdir ./dataset/Gemma-7B/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/wikipedia-cn-20230720-filtered.json \
-        --output-prefix ./dataset/Gemma-7B/wikipedia_cn \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
-        --json-key completion \
-        --workers 16 \
-        --log-interval 1000
-    ```
-6. pre-training
-
-    Config Gemma-7B pre-training script: examples/gemma/pretrain_gemma_7b_ptd.sh
-
-    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # modify config according to your own actual situation
-    CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
-    TOKENIZER_MODEL="./model_from_hf/Gemma-7B/"  #tokenizer path
-    DATA_PATH="./dataset/Gemma-7B/wikipedia_cn_completion_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
-    ```
-
-    Launch Gemma-7B pre-training script: examples/gemma/pretrain_gemma_7b_ptd.sh
-
-    ```shell
-    bash examples/gemma/pretrain_gemma_7b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
-
-
-7. fine-tuning
-
-    7.1 Prepare fine-tuning dataset
-    
-    Download the fine-tuning datasets from [here](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
-    
-    ```bash
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip  --no-check-certificate
-    unzip moss-003-sft-no-tools.jsonl.zip
-    cd ..
-
-    # process datasets  
-    mkdir ./finetune_dataset/Gemma-7B/  
-    python tools/preprocess_data.py \
-        --input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
-        --output-prefix ./finetune_dataset/Gemma-7B/moss \
-        --tokenizer-type PretrainedFromHF \
-        --tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
-        --tokenizer-not-use-fast \
-        --handler-name MOSSInstructionHandler
-    ```
-   
-    7.2 Full Parameters Fine-Tuning
-
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_gemma_7b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-    
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-    ```bash
-    CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
-    DATA_PATH="./finetune_dataset/Gemma-7B/moss"
-    TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
-    CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/" 
-    
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-   
-    7.3 Lora Fine-Tuning
-
-    The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_gemma_7b_ptd.sh script:
-
-    ```bash
-        --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
-        --lora-r 16 \
-        --lora-alpha 32 \
-    ```
-
-    If the vocabulary is changed, add the following parameters:
-
-    ```bash
-        --lora-modules-to-save word_embeddings output_layer \
-    ```
-   
-    The following parameters are added to the resumable training capability of Lora:
-
-    ```bash
-        --load ${ORIGIN_CHECKPOINT}  \
-        --lora-load ${LORA_CHECKPOINT} \
-    ```
-   
-    Launch Gemma-7B lora fine tune script: examples/finetune/tune_gemma_7b_ptd.sh
-
-    ```shell
-    bash examples/gemma/tune_gemma_7b_ptd.sh 
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of Gemma-7B in **Ascend NPU** and **Reference**:
-
-|  Device   |  Model  | throughput rate (tokens/s/p) |
-|:---------:|:-------:|:----------------------------:|
-|   NPUs    | Gemma-7B |             2938             |
-| Reference | Gemma-7B |             2607             |
-
-## Inference
-
-Config Gemma-7B inference script: examples/gemma/generate_gemma_7b_ptd.sh
-
-```bash
-# ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# modify script model path and tokenizer path
-CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
-```
-
-Config Gemma-7B lora inference script: examples/gemma/generate_gemma_7b_lora_ptd.sh
-
-```bash
-# modify lora model path
-CHECKPOINT_LORA="your lora model directory path"
-```
-
-Launch Gemma-7B inference script: examples/gemma/generate_gemma_7b_ptd.sh
-
-```bash
-bash examples/gemma/generate_gemma_7b_ptd.sh
-```
-
-Launch Gemma-7B lora inference script: examples/gemma/generate_gemma_7b_lora_ptd.sh
-
-```bash
-bash examples/gemma/generate_gemma_7b_lora_ptd.sh
-```
-
-Some lora inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/gemma/gemma-7b-lora-inference.jpg)
-
-## Evaluation
-
-We use the [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
-
-Config Gemma-7B evaluation script: examples/gemma/evaluate_gemma_7b_ptd.sh
-
-```bash
-# ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-# Modify the model parameter path and vocabulary path
-TOKENIZER_PATH="./model_from_hf/Gemma-7B/"  # vocabulary path
-CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"  # parameter path
-
-# Configure the task type and dataset path
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Launch Gemma-7B evaluation
-
-```bash
-bash examples/gemma/evaluate_gemma_7b_ptd.sh
-```
-
-| Task | Subset | Question | OpenSource | NPU  |
-|:---:|:---:|:---:|:----------:|:----:|
-| MMLU | 57 | 14042 |    52.2    | 52.2 |
--- a/examples/gemma/evaluate_gemma_2b_ptd.sh
+++ b/examples/gemma/evaluate_gemma_2b_ptd.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# distributed config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=2
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="your tokenizer directory path"
-CHECKPOINT="your model directory path"
-
-# configure task and data path
-DATA_PATH="/../mmlu/test/"
-TASK="mmlu"
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# configure generation parameters
-python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --load ${CHECKPOINT}  \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 18  \
-       --hidden-size 2048  \
-       --ffn-hidden-size 16384 \
-       --num-attention-heads 8  \
-       --group-query-attention \
-       --num-query-groups 1 \
-       --kv-channels 256 \
-       --max-position-embeddings 8192 \
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --geglu \
-       --position-embedding-type rope \
-       --disable-bias-linear \
-       --normalization RMSNorm \
-       --add-rmsnorm-offset \
-       --input-embeds-norm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --norm-epsilon 1e-06 \
-       --evaluation-batch-size 1 \
-       --micro-batch-size 1  \
-       --no-masked-softmax-fusion \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --vocab-size 256000 \
-       --make-vocab-size-divisible-by 1 \
-       --bf16  \
-       --seed 42 | tee logs/evaluation_gemma_2b_${TASK}.log
--- a/examples/gemma/evaluate_gemma_7b_ptd.sh
+++ b/examples/gemma/evaluate_gemma_7b_ptd.sh
@ -1,58 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-# distributed config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="your tokenizer directory path"
-CHECKPOINT="your model directory path"
-
-# configure task and data path
-DATA_PATH="/../mmlu/test/"
-TASK="mmlu"
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-# configure generation parameters
-python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --load ${CHECKPOINT}  \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --num-layers 28  \
-       --hidden-size 3072  \
-       --ffn-hidden-size 24576 \
-       --num-attention-heads 16  \
-       --kv-channels 256 \
-       --max-position-embeddings 8192 \
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --geglu \
-       --position-embedding-type rope \
-       --disable-bias-linear \
-       --normalization RMSNorm \
-       --add-rmsnorm-offset \
-       --input-embeds-norm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --norm-epsilon 1e-06 \
-       --evaluation-batch-size 1 \
-       --micro-batch-size 1  \
-       --use-fused-rmsnorm \
-       --no-masked-softmax-fusion \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --vocab-size 256000 \
-       --make-vocab-size-divisible-by 1 \
-       --bf16  \
-       --seed 42 | tee logs/evaluation_gemma_7b_${TASK}.log
--- a/examples/gemma/generate_gemma_2b_ptd.sh
+++ b/examples/gemma/generate_gemma_2b_ptd.sh
@ -1,56 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export WITHOUT_JIT_COMPILE=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=2
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 2  \
-       --load ${CHECKPOINT}  \
-       --num-layers 18 \
-       --hidden-size 2048  \
-       --kv-channels 256 \
-       --group-query-attention \
-       --num-query-groups 1 \
-       --ffn-hidden-size 16384 \
-       --num-attention-heads 8  \
-       --position-embedding-type rope \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --geglu \
-       --input-embeds-norm \
-       --micro-batch-size 1 \
-       --norm-epsilon 1e-06 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --normalization RMSNorm \
-       --add-rmsnorm-offset \
-       --disable-bias-linear \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --attention-softmax-in-fp32 \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 1 \
-       --vocab-size 256000 \
-       --bf16 \
-       --seed 42 \
-       | tee logs/generate_gemma_2b.log
--- a/examples/gemma/generate_gemma_7b_lora_ptd.sh
+++ b/examples/gemma/generate_gemma_7b_lora_ptd.sh
@ -1,63 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-CHECKPOINT_LORA="your lora model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --load ${CHECKPOINT}  \
-       --num-layers 28 \
-       --hidden-size 3072  \
-       --kv-channels 256 \
-       --ffn-hidden-size 24576 \
-       --num-attention-heads 16  \
-       --position-embedding-type rope \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --geglu \
-       --input-embeds-norm \
-       --micro-batch-size 1 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --normalization RMSNorm \
-       --add-rmsnorm-offset \
-       --norm-epsilon 1e-06 \
-       --disable-bias-linear \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --attention-softmax-in-fp32 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-masked-softmax-fusion \
-       --no-gradient-accumulation-fusion \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 1 \
-       --vocab-size 256000 \
-       --bf16 \
-       --seed 42 \
-       --lora-load ${CHECKPOINT_LORA} \
-       --lora-r 16 \
-       --lora-alpha 32 \
-       --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
-       --inference-prompt-type 'alpaca' \
-       | tee logs/generate_gemma_7b.log
--- a/examples/gemma/generate_gemma_7b_ptd.sh
+++ b/examples/gemma/generate_gemma_7b_ptd.sh
@ -1,57 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --load ${CHECKPOINT}  \
-       --num-layers 28 \
-       --hidden-size 3072  \
-       --kv-channels 256 \
-       --ffn-hidden-size 24576 \
-       --num-attention-heads 16  \
-       --position-embedding-type rope \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --geglu \
-       --input-embeds-norm \
-       --micro-batch-size 1 \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --normalization RMSNorm \
-       --add-rmsnorm-offset \
-       --norm-epsilon 1e-06 \
-       --disable-bias-linear \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --attention-softmax-in-fp32 \
-       --no-load-optim \
-       --no-load-rng \
-       --no-masked-softmax-fusion \
-       --no-gradient-accumulation-fusion \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 1 \
-       --vocab-size 256000 \
-       --bf16 \
-       --seed 42 \
-       | tee logs/generate_gemma_7b.log
--- a/examples/gemma/pretrain_gemma_2b_ptd.sh
+++ b/examples/gemma/pretrain_gemma_2b_ptd.sh
@ -1,95 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=1
-PP=2
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --use-mc2 \
-    --use-fused-rmsnorm \
-    --num-layers 18 \
-    --hidden-size 2048 \
-    --ffn-hidden-size 16384 \
-    --num-attention-heads 8 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --seq-length 8192 \
-    --max-position-embeddings 8192 \
-    --micro-batch-size 1 \
-    --global-batch-size 256 \
-    --kv-channels 256 \
-    --group-query-attention \
-    --num-query-groups 1 \
-    --make-vocab-size-divisible-by 1 \
-    --lr 1.25e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --add-rmsnorm-offset \
-    --geglu \
-    --input-embeds-norm \
-    --use-flash-attn \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --use-distributed-optimizer \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 1000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load ${CKPT_LOAD_DIR} \
-    --save ${CKPT_SAVE_DIR} \
-    | tee logs/train_gemma_2b.log
--- a/examples/gemma/pretrain_gemma_7b_ptd.sh
+++ b/examples/gemma/pretrain_gemma_7b_ptd.sh
@ -1,95 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --use-mc2 \
-    --use-fused-rmsnorm \
-    --use-fused-rotary-pos-emb \
-    --num-layers 28 \
-    --hidden-size 3072 \
-    --ffn-hidden-size 24576 \
-    --num-attention-heads 16 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --seq-length 8192 \
-    --max-position-embeddings 8192 \
-    --micro-batch-size 2 \
-    --global-batch-size 64 \
-    --kv-channels 256 \
-    --make-vocab-size-divisible-by 1 \
-    --lr 1.25e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --add-rmsnorm-offset \
-    --norm-epsilon 1e-06 \
-    --geglu \
-    --input-embeds-norm \
-    --use-flash-attn \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --vocab-size 256000 \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 1000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load ${CKPT_LOAD_DIR} \
-    --save ${CKPT_SAVE_DIR} \
-    | tee logs/train_gemma_7b.log
--- a/examples/gemma/tune_gemma_7b_ptd.sh
+++ b/examples/gemma/tune_gemma_7b_ptd.sh
@ -1,103 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-LORA_CHECKPOINT="your lora ckpt path"
-
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --use-mc2 \
-    --use-fused-rmsnorm \
-    --use-fused-rotary-pos-emb \
-    --num-layers 28 \
-    --hidden-size 3072 \
-    --ffn-hidden-size 24576 \
-    --num-attention-heads 16 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --seq-length 8192 \
-    --max-position-embeddings 8192 \
-    --micro-batch-size 2 \
-    --global-batch-size 64 \
-    --kv-channels 256 \
-    --make-vocab-size-divisible-by 1 \
-    --lr 1.25e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --add-rmsnorm-offset \
-    --norm-epsilon 1e-06 \
-    --geglu \
-    --input-embeds-norm \
-    --use-flash-attn \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --vocab-size 256000 \
-    --finetune \
-    --is-instruction-dataset \
-    --lora-r 16 \
-    --lora-alpha 32 \
-    --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 2000 \
-    --eval-interval 1000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load ${CKPT_LOAD_DIR} \
-    --lora-load ${LORA_CHECKPOINT} \
-    --save ${CKPT_SAVE_DIR} \
-    | tee logs/tune_gemma_7b.log
--- a/examples/gpt3/pretrain_gpt3_15B_ptd.sh
+++ b/examples/gpt3/pretrain_gpt3_15B_ptd.sh
@ -1,88 +0,0 @@
-#!/bin/bash
-
-
-# Runs the "175B" parameter model in deminishing layers for single machine
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export AZUREML_EXPERIMENT_ID=0
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NUM_NODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
-
-DATA_PATH="your dataset path"
-VOCAB_FILE="vocab file for gpt"
-MERGE_FILE="merge file for gpt"
-
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NUM_NODES \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT \
-    --node_rank $NODE_RANK
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --num-layers 8 \
-    --hidden-size 12288 \
-    --num-attention-heads 96 \
-    --seq-length 2048 \
-    --max-position-embeddings 2048 \
-    --transformer-impl local \
-    --micro-batch-size 1 \
-    --global-batch-size 64 \
-    --train-iters 2000 \
-    --weight-decay 0.1 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --init-method-std 0.006 \
-    --clip-grad 1.0 \
-    --fp16 \
-    --lr 6.0e-5 \
-    --lr-decay-style cosine \
-    --min-lr 6.0e-6 \
-    --lr-warmup-fraction .001 \
-    --lr-decay-iters 430000 \
-    --no-load-optim \
-    --no-load-rng \
-    --no-gradient-accumulation-fusion \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --attention-dropout 0.0 \
-    --hidden-dropout 0.0 \
-    --use-flash-attn \
-    --no-bias-gelu-fusion \
-    --use-mc2
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH
-    --vocab-file $VOCAB_FILE
-    --merge-file $MERGE_FILE
-    --split 949,50,1
-"
-
-OUTPUT_ARGS="
-    --log-interval 1
-    --eval-interval 5000
-    --eval-iters 1
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --jit-compile \
-    --distributed-backend nccl 2>&1
-    | tee ./logs/pretrain_gpt3_175B_8layers.log
--- a/examples/gpt3/pretrain_gpt3_175B_ptd.sh
+++ b/examples/gpt3/pretrain_gpt3_175B_ptd.sh
@ -1,89 +0,0 @@
-#!/bin/bash
-
-# Runs the "175B" parameter model in full layers.
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export AZUREML_EXPERIMENT_ID=0
-
-GPUS_PER_NODE=8
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NUM_NODES=16
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
-
-DATA_PATH="your dataset path"
-VOCAB_FILE="vocab file for gpt training"
-MERGE_FILE="merge file for gpt training"
-
-TP=8
-PP=16
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NUM_NODES \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT \
-    --node_rank $NODE_RANK
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --num-layers-per-virtual-pipeline-stage 2 \
-    --sequence-parallel \
-    --num-layers 96 \
-    --hidden-size 12288 \
-    --num-attention-heads 96 \
-    --seq-length 2048 \
-    --max-position-embeddings 2048 \
-    --transformer-impl local \
-    --micro-batch-size 2 \
-    --global-batch-size 1024 \
-    --train-iters 2000 \
-    --weight-decay 0.1 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --init-method-std 0.006 \
-    --clip-grad 1.0 \
-    --fp16 \
-    --lr 6.0e-5 \
-    --lr-decay-style cosine \
-    --min-lr 6.0e-6 \
-    --lr-warmup-fraction .001 \
-    --lr-decay-iters 430000 \
-    --no-load-optim \
-    --no-load-rng \
-    --no-gradient-accumulation-fusion \
-    --no-masked-softmax-fusion \
-    --no-bias-gelu-fusion \
-    --attention-softmax-in-fp32 \
-    --attention-dropout 0.0 \
-    --hidden-dropout 0.0 \
-    --use-flash-attn \
-    --overlap-grad-reduce \
-    --use-mc2
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH
-    --vocab-file $VOCAB_FILE
-    --merge-file $MERGE_FILE
-    --split 949,50,1
-"
-
-OUTPUT_ARGS="
-    --log-interval 1
-    --eval-interval 5000
-    --eval-iters 1
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --jit-compile \
-    --distributed-backend nccl 2>&1 
-    | tee ./logs/pretrain_gpt3_175B.log
--- a/examples/gpt3/readme.md
+++ b/examples/gpt3/readme.md
@ -1,136 +0,0 @@
-# GPT3  $\color{black}{\bf\tiny{【社区贡献模型】}}$
-
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README.md">English</a> </b> 
-</p>
-
-# 目录
-
- [GPT3](#GPT3)
- [目录](#目录)
- [GPT3-175B](#GPT3-175B)
-  - [训练-175B](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-
-# GPT3-175B
-
-## 训练
-
-GPT3-175B 训练的硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 128 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd .. 
-    cd ModelLink
-    mkdir logs
-    mkdir vocab_file
-    mkdir dataset
-    ```
-
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # 安装 torch 和 torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 安装 MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-
-    # 安装其他依赖
-    pip install -r requirements.txt 
-    ```
-
-3. 准备数据、词表来拉起模型
-    3.1 准备数据
-    
-    可以从 [这里](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en) 下载原始数据
-    ```shell
-    # 下载 enwiki 数据
-    # 总共有 41 个文件，我们可以选择部分来制作数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet
-    cd ..
-
-    # 下载 vocab file 和 merge table
-    cd vocab_file
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-    cd ..
-
-    # 处理成训练数据
-    python ./tools/preprocess_data.py \
-        --input ./dataset/ \
-        --output-prefix ./dataset/gpt_text_sentence \
-        --tokenizer-type GPT2BPETokenizer \
-        --vocab-file ./vocab_file/gpt2-vocab.json \
-        --merge-file ./vocab_file/gpt2-merges.txt \
-        --append-eod \
-        --workers 4 \
-        --log-interval 1000
-    ```
-
-    3.2 用 ptd 模式进行预训练
-    配置 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh
-
-    ```shell
-    # 请根据真实情况配置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 请根据真实存放路径配置以下参数
-    VOCAB_FILE="./vocab_file/gpt2-vocab.json"   # 词表
-    MERGE_FILE="./vocab_file/gpt2-merges.txt"   # BPE 合并表
-    DATA_PATH="./dataset/gpt_text_sentence"     # 数据路径
-    ```
-
-   拉起 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh
-
-    ```shell
-    bash examples/gpt3/pretrain_gpt3_175B.sh
-    ```
-
-### 性能
-
-#### 吞吐
-
-GPT3-175B 在 **昇腾芯片**上的性能数据：
-
-| 设备 |    模型    | tokens吞吐 (tokens/s/p) | 
-| :--: | :--------: |:---------------------:| 
-| NPUs | GPT3-175B |        153.1         |
-
--- a/examples/gpt3/readme_en.md
+++ b/examples/gpt3/readme_en.md
@ -1,136 +0,0 @@
-# GPT3  $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$
-
-<p align="left">
-        <b>English</b> |
-        <b><a href="README_en.md">English</a> </b> 
-</p>
-
-# Contents
-
- [GPT3](#GPT3)
- [Contents](#contents)
- [GPT3-175B](#GPT3-175B)
-  - [Training-175B](#training)
-    - [Script](#script)
-    - [Perforfance](#performance)
-      - [Machine performance](#machine-performance)
-
-# GPT3-175B
-
-## Training
-
-Here is a hardware summary of pre-trianing GPT3-175B:
-
-| Hardware |       Value       |
-| :--: | :-------------: |
-|    NPU   | 128 x Ascend NPUs |
-
-### Script
-
-1. Clone repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd .. 
-    cd ModelLink
-    mkdir logs
-    mkdir vocab_file
-    mkdir dataset
-    ```
-
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # modify ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-
-3. Prepare dataset and vocab file for pretrain
-    3.1 Prepare dataset
-    
-    Download the GPT raw dataset from [here](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en)
-    ```shell
-    # download enwiki raw data
-    # There are 41 files in total, we can just select part to make our datasets.
-    cd ./dataset
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet
-    wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet
-    cd ..
-
-    # download vocab file and merge table
-    cd vocab_file
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-    cd ..
-
-    # process formal dataset
-    python ./tools/preprocess_data.py \
-        --input ./dataset/ \
-        --output-prefix ./dataset/gpt_text_sentence \
-        --tokenizer-type GPT2BPETokenizer \
-        --vocab-file ./vocab_file/gpt2-vocab.json \
-        --merge-file ./vocab_file/gpt2-merges.txt \
-        --append-eod \
-        --workers 4 \
-        --log-interval 1000
-    ```
-
-    3.2 pre-training in ptd mode
-    Config GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh
-
-    ```shell
-    # modify ascend-toolkit path according to your own config
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # modify config according to your own actual situation
-    VOCAB_FILE="./vocab_file/gpt2-vocab.json"   # vocab file for training
-    MERGE_FILE="./vocab_file/gpt2-merges.txt"   # BPE merge file for training
-    DATA_PATH="./dataset/gpt_text_sentence"  # dataset path
-    ```
-
-   Launch GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh
-
-    ```shell
-    bash examples/gpt3/pretrain_gpt3_175B.sh
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of GPT3-175B in **Ascend NPU**：
-
-| device |    model    | tokens capacity (tokens/s/p) | 
-| :--: | :--------: |:---------------------:| 
-| NPUs | GPT3-175B |        153.1         |
-
--- a/examples/intern/README.md
+++ b/examples/intern/README.md
@ -1,4 +1,4 @@
-# Intern-LM
+# Intern-LM  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$

 <p align="left">
        <b>简体中文</b> |
@ -35,141 +35,145 @@ InternLM-7B 训练的硬件配置如下:

 1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. 下载 Internlm-7B [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)

-    ```shell
-    mkdir ./model_from_hf/internlm-7b/
-    cd ./model_from_hf/internlm-7b/
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/internlm-7b/
+cd ./model_from_hf/internlm-7b/
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. 下载 Internlm-7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    ```shell
-    #!/bin/bash
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    mkdir ./dataset/internlm-7b/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
-        --output-prefix ./dataset/internlm-7b/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF  \
-        --handler-name AlpacaPretrainHandler  \
-        --tokenizer-not-use-fast \
-        --append-eod
-    ```
+```shell
+#!/bin/bash
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+mkdir ./dataset/internlm-7b/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
+    --output-prefix ./dataset/internlm-7b/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF  \
+    --handler-name AlpacaPretrainHandler  \
+    --tokenizer-not-use-fast \
+    --append-eod
+```

 5. 权重格式转换

-    将模型权重从 huggingface 格式转换为 ModelLink 可以处理的格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
+将模型权重从 huggingface 格式转换为 ModelLink 可以处理的格式
+***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/internlm-7b/ \
-        --save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
-        --add-qkv-bias \
-        --add-dense-bias
-    ```
+```shell
+mkdir model_weights
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/internlm-7b/ \
+    --save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
+    --add-qkv-bias \
+    --add-dense-bias
+```

-    任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
+任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
+***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***

-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --add-qkv-bias \
-        --add-dense-bias \
-        --save-dir ./model_from_hf/internlm-7b/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/internlm-7b/mg2hg/
-    ```
+```shell
+# 请按照您的真实环境修改 set_env.sh 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --add-qkv-bias \
+    --add-dense-bias \
+    --save-dir ./model_from_hf/internlm-7b/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/internlm-7b/mg2hg/
+```

 6. 配置 Internlm-7B 预训练脚本

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    # 修改数据集，词表，权重等路径
-    CKPT_SAVE_DIR="./ckpt/internlm-7b/"
-    CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
-    TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
-    DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
-    ```
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改数据集，词表，权重等路径
+CKPT_SAVE_DIR="./ckpt/internlm-7b/"
+CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
+TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
+DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
+```

 7. 启动 Internlm-7B 预训练脚本

-    ```shell
-    bash examples/intern/pretrain_internlm_7b_ptd.sh 
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/intern/pretrain_internlm_7b_ptd.sh 
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+

 ### 性能

@ -184,25 +188,23 @@ Internlm-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：


 #### 推理
-
-推理脚本: examples/intern/generate_internlm_7b_ptd.sh
-
+推理脚本</a>：
+tasks/inference/generate_lnternlm_7b_ptd.sh
 ```
-bash ./examples/intern/generate_internlm_7b_ptd.sh
+bash ./tasks/inference/generate_lnternlm_7b_ptd.sh
 ```

 推理举例：
-![Internlm-7b-inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/intern/intern_7B_inference.png)
+![Internlm-7b-inference](../../sources/images/intern/intern_7B_inference.png)

 #### 评估

 使用MMLU数据集评估模型。数据集[下载](https://huggingface.co/datasets/cais/mmlu)

-评估脚本`</a>`:
-examples/intern/evaluate_internlm_7B_ptd.sh
-
+评估脚本</a>:
+tasks/evaluation/evaluate_internlm_7B_ptd.sh 
 ```
-bash  examples/intern/evaluate_internlm_7B_ptd.sh
+bash  tasks/evaluation/evaluate_internlm_7B_ptd.sh
 ```

 InternLM-7B在**Ascend NPU**中的评测表现：
@ -226,102 +228,104 @@ InternLM-65B 训练的硬件配置如下:

 1. 克隆仓库到本地服务器

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. 搭建环境

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # 安装 torch 和 torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# 安装 torch 和 torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# 安装加速库
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
+# 安装其余依赖库
+pip install -r requirements.txt 
+```

 3. 下载 [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)

-    ```shell
-    mkdir ./model_from_hf/internlm-65b/
-    cd ./model_from_hf/internlm-65b/
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/internlm-65b/
+cd ./model_from_hf/internlm-65b/
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. 下载 Internlm-65B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    ```shell
-    #!/bin/bash
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    mkdir ./dataset/internlm-65b/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
-        --output-prefix ./dataset/internlm-65b/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF  \
-        --handler-name AlpacaPretrainHandler  \
-        --tokenizer-not-use-fast \
-        --append-eod
-    ```
+```shell
+#!/bin/bash
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+mkdir ./dataset/internlm-65b/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
+    --output-prefix ./dataset/internlm-65b/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF  \
+    --handler-name AlpacaPretrainHandler  \
+    --tokenizer-not-use-fast \
+    --append-eod
+```

 5. 配置 Internlm-65B 预训练脚本

-    ```shell
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    # 修改数据集，词表，权重等路径
-    CKPT_SAVE_DIR="./ckpt/internlm-65b/"
-    TOKENIZER_PATH="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
-    DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
-    ```
+```shell
+# 修改 ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# 修改数据集，词表，权重等路径
+CKPT_SAVE_DIR="./ckpt/internlm-65b/"
+TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
+DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
+```

 6. 启动 Internlm-65B 预训练脚本

-    ```shell
-    bash examples/intern/pretrain_internlm_65b_ptd.sh 
-    ```
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
+```shell
+bash examples/intern/pretrain_internlm_65b_ptd.sh 
+```
+
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/intern/README_en.md
+++ b/examples/intern/README_en.md
@ -1,4 +1,4 @@
-# Intern-LM
+# Intern-LM  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 
 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
        <b>English</b> 
@ -36,141 +36,144 @@ Here's a hardware summary of pre-training InternLM-7B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Download the Internlm-7B tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main) 

-    ```shell
-    mkdir ./model_from_hf/internlm-7b/
-    cd ./model_from_hf/internlm-7b/
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/internlm-7b/
+cd ./model_from_hf/internlm-7b/
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. Prepare dataset. Download the Internlm-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    ```shell
-    #!/bin/bash
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    mkdir ./dataset/internlm-7b/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
-        --output-prefix ./dataset/internlm-7b/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF  \
-        --handler-name AlpacaPretrainHandler  \
-        --tokenizer-not-use-fast \
-        --append-eod
-    ```
+```shell
+#!/bin/bash
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+mkdir ./dataset/internlm-7b/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
+    --output-prefix ./dataset/internlm-7b/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF  \
+    --handler-name AlpacaPretrainHandler  \
+    --tokenizer-not-use-fast \
+    --append-eod
+```

 5. Weights convert

-    In order to adapt to the internlm-7B model, the following script is used to convert the model pre-training weights.
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+In order to adapt to the internlm-7B model, the following script is used to convert the model pre-training weights.
+***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

-    ```shell
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/internlm-7b/ \
-        --save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
-        --add-qkv-bias \
-        --add-dense-bias
-    ```
+```shell
+mkdir model_weights
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader llama2_hf \
+    --saver megatron \
+    --target-tensor-parallel-size 8 \
+    --target-pipeline-parallel-size 1 \
+    --load-dir ./model_from_hf/internlm-7b/ \
+    --save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
+    --tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
+    --add-qkv-bias \
+    --add-dense-bias
+```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
+Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --add-qkv-bias \
-        --add-dense-bias \
-        --save-dir ./model_from_hf/internlm-7b/    # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/internlm-7b/mg2hg/
-    ```
+```shell
+# Modify the ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+python tools/checkpoint/util.py \
+    --model-type GPT \
+    --loader megatron \
+    --saver megatron \
+    --save-model-type save_huggingface_llama \
+    --load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 1 \
+    --add-qkv-bias \
+    --add-dense-bias \
+    --save-dir ./model_from_hf/internlm-7b/    # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/internlm-7b/mg2hg/
+```

 6. Config Internlm-7B pre-training script.

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    # modify script orign dataset path according to your own dataset path
-    CKPT_SAVE_DIR="./ckpt/internlm-7b/"
-    CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
-    TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model"  #tokenizer path
-    DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
-    ```
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify script orign dataset path according to your own dataset path
+CKPT_SAVE_DIR="./ckpt/internlm-7b/"
+CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
+TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model"  #tokenizer path
+DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
+```

 7. Launch Internlm-7B pre-training script.

-    ```shell
-    bash examples/intern/pretrain_internlm_7b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```shell
+bash examples/intern/pretrain_internlm_7b_ptd.sh 
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -185,25 +188,23 @@ The performance of Internlm-7B in **Ascend NPU** and **Reference**:


 #### Inference
-
-Inference script`</a>`：
-examples/intern/generate_lnternlm_7b_ptd.sh
-
+Inference script</a>：
+tasks/inference/generate_lnternlm_7b_ptd.sh
 ```
-bash ./examples/intern/generate_lnternlm_7b_ptd.sh
+bash ./tasks/inference/generate_lnternlm_7b_ptd.sh
 ```

 Inference case:
-![Internlm-7b-inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/intern/intern_7B_inference.png)
+![Internlm-7b-inference](../../sources/images/intern/intern_7B_inference.png)

 #### Evaluation

 Evaluating the model using the MMLU dataset. dataset [download](https://huggingface.co/datasets/cais/mmlu)

-Evaluation script: examples/intern/evaluate_internlm_7B_ptd.sh
-
+Evaluation script</a>:
+tasks/evaluation/evaluate_internlm_7B_ptd.sh 
 ```
-bash  examples/intern/evaluate_internlm_7B_ptd.sh
+bash  tasks/evaluation/evaluate_internlm_7B_ptd.sh
 ```

 The evaluation performance of LLaMA-7B/13B in **Ascend NPU**:
@ -226,102 +227,105 @@ Here's a hardware summary of pre-training InternLM-65B:

 1. Clone the repository to your local server:

-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
+```shell
+git clone https://gitee.com/ascend/ModelLink.git
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout -f bcce6f
+cp -r megatron ../ModelLink/
+cd ..
+cd ModelLink
+git checkout 1.0
+mkdir logs
+mkdir model_from_hf
+mkdir dataset
+mkdir ckpt
+```

 2. Build environment

-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
+```bash
+# python3.8
+conda create -n test python=3.8
+conda activate test

-    # install torch and torch_npu 
-    pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
-    pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
+# install torch and torch_npu 
+pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
+pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
+pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify the path according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 

-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
+# install MindSpeed
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 224ae35e8fc96778f957029d1371ddb623452a50
+pip install -r requirements.txt 
+pip3 install -e .
+cd ..

-    # install other packages
-    pip install -r requirements.txt 
-    ```
+# install other packages
+pip install -r requirements.txt 
+```

 3. Download tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main) 

-    ```shell
-    mkdir ./model_from_hf/internlm-65b/
-    cd ./model_from_hf/internlm-65b/
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
-    wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+```shell
+mkdir ./model_from_hf/internlm-65b/
+cd ./model_from_hf/internlm-65b/
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
+wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
+cd ../../
+```

 4. Prepare dataset. Download the Internlm-65B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

-    ```shell
-    cd dataset/
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    ```
+```shell
+cd dataset/
+wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+cd ..
+```

-    ```shell
-    #!/bin/bash
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    mkdir ./dataset/internlm-65b/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
-        --output-prefix ./dataset/internlm-65b/alpaca \
-        --workers 4 \
-        --log-interval 1000  \
-        --tokenizer-type PretrainedFromHF  \
-        --handler-name AlpacaPretrainHandler  \
-        --tokenizer-not-use-fast \
-        --append-eod
-    ```
+```shell
+#!/bin/bash
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+mkdir ./dataset/internlm-65b/
+python ./tools/preprocess_data.py \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
+    --output-prefix ./dataset/internlm-65b/alpaca \
+    --workers 4 \
+    --log-interval 1000  \
+    --tokenizer-type PretrainedFromHF  \
+    --handler-name AlpacaPretrainHandler  \
+    --tokenizer-not-use-fast \
+    --append-eod
+```

 5. Config Internlm-65B pre-training script.

-    ```shell
-    # modify the script according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    # modify script orign dataset path according to your own dataset path
-    CKPT_SAVE_DIR="./ckpt/internlm-65b/"
-    TOKENIZER_PATH="./model_from_hf/internlm-65b/tokenizer.model"  #tokenizer path
-    DATA_PATH="./dataset/internlm-65b/alpaca_text_document"  #processed dataset
-    ```
+```shell
+# modify the script according to your own  ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+# modify script orign dataset path according to your own dataset path
+CKPT_SAVE_DIR="./ckpt/internlm-65b/"
+TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model"  #tokenizer path
+DATA_PATH="./dataset/internlm-65b/alpaca_text_document"  #processed dataset
+```

 6. Launch Internlm-65B pre-training script.

-    ```shell
-    bash examples/intern/pretrain_internlm_65b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+```shell
+bash examples/intern/pretrain_internlm_65b_ptd.sh 
+```
+
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+

 ### Performance

--- a/examples/intern/pretrain_internlm_65b_ptd.sh
+++ b/examples/intern/pretrain_internlm_65b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -89,7 +90,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${CKPT_SAVE_DIR} \
    | tee logs/train_internlm_65B.log

--- a/examples/intern/pretrain_internlm_7b_ptd.sh
+++ b/examples/intern/pretrain_internlm_7b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -90,6 +91,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${CKPT_SAVE_DIR} \
    | tee logs/train_internlm_7b.log
--- a/examples/llama/README.md
+++ b/examples/llama/README.md
--- a/examples/llama/README_en.md
+++ b/examples/llama/README_en.md
--- a/examples/llama/pretrain_llama_13b_ptd.sh
+++ b/examples/llama/pretrain_llama_13b_ptd.sh
@ -86,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${SAVE_CHECKPOINT_PATH} \
    | tee logs/train_llama_13b.log
--- a/examples/llama/pretrain_llama_33B_ptd_32p.sh
+++ b/examples/llama/pretrain_llama_33B_ptd_32p.sh
@ -87,6 +87,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_llama_33b.log
--- a/examples/llama/pretrain_llama_65b_ptd.sh
+++ b/examples/llama/pretrain_llama_65b_ptd.sh
@ -87,7 +87,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${SAVE_CHECKPOINT_PATH} \
    | tee logs/train_llama_65b.log

--- a/examples/llama/pretrain_llama_7b_ptd.sh
+++ b/examples/llama/pretrain_llama_7b_ptd.sh
@ -86,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${SAVE_CHECKPOINT_PATH} \
    | tee logs/train_llama_7b.log
--- a/examples/llama2/README.md
+++ b/examples/llama2/README.md
--- a/examples/llama2/README_en.md
+++ b/examples/llama2/README_en.md
--- a/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+++ b/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
@ -1,5 +1,7 @@
 #!/bin/bash
+
 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -35,8 +37,8 @@ GPT_ARGS="
    --tokenizer-model ${TOKENIZER_MODEL} \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
-    --micro-batch-size 4 \
-    --global-batch-size 512 \
+    --micro-batch-size 2 \
+    --global-batch-size 16 \
    --make-vocab-size-divisible-by 1 \
    --lr 1e-6 \
    --train-iters 5000 \
@ -64,9 +66,6 @@ GPT_ARGS="
    --load ${CKPT_LOAD_DIR}  \
    --no-load-optim \
    --no-load-rng \
-    --use-fused-swiglu \
-    --use-fused-rotary-pos-emb \
-    --use-mc2 \
    --bf16
 "

@ -87,6 +86,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_llama2_13b.log
--- a/examples/llama2/pretrain_llama2_34B_ptd_16p.sh
+++ b/examples/llama2/pretrain_llama2_34B_ptd_16p.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 NPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -67,10 +68,7 @@ GPT_ARGS="
    --load ${CKPT_LOAD_DIR}  \
    --no-load-optim \
    --no-load-rng \
-    --bf16 \
-    --use-fused-swiglu \
-    --use-fused-rotary-pos-emb \
-    --use-mc2 \
+    --bf16
 "

 DATA_ARGS="
--- a/examples/llama2/pretrain_llama2_70b_ptd.sh
+++ b/examples/llama2/pretrain_llama2_70b_ptd.sh
@ -1,10 +1,11 @@
 #!/bin/bash
+export NPU_ASD_ENABLE=0
 export CUDA_DEVICE_MAX_CONNECTIONS=1

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
 MASTER_PORT=6000
-NNODES=4
+NNODES=8
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

@ -13,7 +14,7 @@ DATA_PATH="your data path"
 TOKENIZER_MODEL="your tokenizer path"
 CKPT_LOAD_DIR="your model ckpt path"
 TP=8
-PP=4
+PP=8

 DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
@ -35,7 +36,7 @@ GPT_ARGS="
    --tokenizer-model ${TOKENIZER_MODEL} \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
-    --micro-batch-size 1 \
+    --micro-batch-size 2 \
    --global-batch-size 1024  \
    --make-vocab-size-divisible-by 1 \
    --lr 1.0e-6 \
@ -67,11 +68,7 @@ GPT_ARGS="
    --group-query-attention \
    --num-query-groups 8 \
    --lr-warmup-fraction 0.01 \
-    --bf16 \
-    --use-fused-swiglu \
-    --use-fused-rotary-pos-emb \
-    --num-layers-per-virtual-pipeline-stage 5 \
-    --use-mc2 \
+    --bf16
 "

 DATA_ARGS="
--- a/examples/llama2/pretrain_llama2_7b_ptd.sh
+++ b/examples/llama2/pretrain_llama2_7b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -13,8 +14,8 @@ CKPT_SAVE_DIR="your model save ckpt path"
 DATA_PATH="your data path"
 TOKENIZER_MODEL="your tokenizer path"
 CKPT_LOAD_DIR="your model ckpt path"
-TP=1
-PP=2
+TP=8
+PP=1

 DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
@ -36,8 +37,8 @@ GPT_ARGS="
    --tokenizer-model ${TOKENIZER_MODEL} \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
-    --micro-batch-size 1 \
-    --global-batch-size 256 \
+    --micro-batch-size 4 \
+    --global-batch-size 16 \
    --make-vocab-size-divisible-by 1 \
    --lr 1.25e-6 \
    --train-iters 5000 \
@ -64,23 +65,19 @@ GPT_ARGS="
    --no-gradient-accumulation-fusion \
    --no-load-optim \
    --no-load-rng \
-    --use-distributed-optimizer \
-    --use-fused-swiglu \
-    --use-fused-rotary-pos-emb \
-    --overlap-grad-reduce \
    --bf16
 "

 DATA_ARGS="
    --data-path $DATA_PATH \
-    --split 949,50,1
+    --split 100,0,0
 "

 OUTPUT_ARGS="
    --log-interval 1 \
    --save-interval 10000 \
    --eval-interval 1000 \
-    --eval-iters 10 \
+    --eval-iters 0 \
 "

 torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
@ -88,7 +85,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
-    --jit-compile \
    --load $CKPT_LOAD_DIR \
    --save $CKPT_SAVE_DIR \
    | tee logs/train_llama2_7b.log
--- a/examples/llama3/README.md
+++ b/examples/llama3/README.md
@ -1,577 +0,0 @@
-# LLaMA3  $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
-
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-</p>
-
-# 目录
-
- [LLaMA3](#llama3)
- [目录](#目录)
- [LLAMA3-8B](#llama3-8b)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-  - [推理-8B](#推理-8b)
-  - [评估-8B](#评估-8b)
- [LLAMA3-70B](#llama3-70b)
-  - [训练](#训练)
-    - [脚本](#脚本)
-    - [性能](#性能)
-      - [吞吐](#吞吐)
-  - [推理-70B](#推理-70b)
-  - [评估-70B](#评估-70b)
-
-# LLAMA3-8B
-
-## 训练
-
-LLAMA3-8B 训练的硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 8 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # 安装 torch 和 torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-    
-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
-3. 下载 LLAMA3-8B 的 [预训练权重和词表](https://huggingface.co/unsloth/llama-3-8B/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/llama-3-8b-hf/
-    cd ./model_from_hf/llama-3-8b-hf/
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/config.json
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/generation_config.json
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00001-of-00004.safetensors
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00002-of-00004.safetensors
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00003-of-00004.safetensors
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00004-of-00004.safetensors
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model.safetensors.index.json
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/special_tokens_map.json
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/tokenizer.json
-    wget https://huggingface.co/unsloth/llama-3-8B/raw/main/tokenizer_config.json
-    cd ../../
-    ```
-4. 权重转换
-
-    4.1 将权重从 huggingface 格式转化为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```bash
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # 权重格式转换
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/llama-3-8b-hf/ \
-        --save-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/llama-3-8b-hf/tokenizer.json
-    ```
-
-    4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/llama-3-8b-hf/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/llama-3-8b-hf/mg2hg/
-    ```
-
-    权重转换适用于预训练、微调、推理和评估，根据任务不同调整参数 `target-tensor-parallel-size`和 `target-pipeline-parallel-size`。
-
-5. 预训练
-
-    5.1 准备数据集
-
-    下载 LLaMA3-8B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    # 处理数据   
-    mkdir ./dataset/llama-3-8b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
-        --output-prefix ./dataset/llama-3-8b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 预训练
-    配置llama3-8B 预训练脚本: examples/llama3/pretrain_llama3_8b_ptd.sh
-
-    ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    CKPT_SAVE_DIR="./ckpt/llama-3-8b-hf/"
-    TOKENIZER_MODEL="./model_from_hf/llama-3-8b-hf/"  #词表路径
-    DATA_PATH="./dataset/llama-3-8b-hf/alpaca_text_document"  #数据集路径
-    CKPT_LOAD_DIR="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/" #权重路径
-    ```
-
-    多机运行增加参数--overlap-grad-reduce
-
-    启动 LLaMA3-8B 预训练脚本: examples/llama3/pretrain_llama3_8b_ptd.sh
-
-    ```shell
-    bash examples/llama3/pretrain_llama3_8b_ptd.sh
-    ```
-
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-6. 微调
-
-    6.1 准备微调数据集
-    
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/llama-3-8b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
-        --output-prefix ./finetune_dataset/llama-3-8b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-    
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*
-    增加微调参数--finetune，增加预训练权重加载参数--load，使微调从第一步开始。更改为以下参数：
-
-    ```bash
-    DATA_PATH="./finetune_dataset/llama-3-8b-hf/alpaca"
-    TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
-    CKPT_PATH="./ckpt/llama-3-8b-hf/"
-        --load ${CKPT_PATH} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-### 性能
-
-#### 吞吐
-
-LLaMA3-8B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 |   模型   | 迭代数 | tokens吞吐 (tokens/s/p) |
-| :--: | :-------: | :----: | :---------------------: |
-| NPUs | LLaMA3-8B |  1000  |        2483         |
-| 参考 | LLaMA3-8B |  1000  |        2674          |
-## 推理-8B
-
-配置llama3-8B 推理脚本: examples/llama3/generate_llama3_8b_ptd.sh
-
-```bash
-# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"
-TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
-```
-
-启动llama3-8B 推理脚本
-
-```bash
-bash examples/llama3/generate_llama3_8b_ptd.sh
-```
-
-**如果想要使用大模型聊天功能，请下载对应的Instruct/Chat权重，并转换为Megatron格式，配置相关路径，
-启动聊天脚本: examples/llama3/generate_llama3_8b_chat_ptd.sh**
-
-LLAMA3-8B-Instruct 权重[下载](https://huggingface.co/unsloth/llama-3-8b-Instruct)
-
-聊天示例如下：
-```
-You >> 你好
-ModelLink:
-你好！我是你的中文语言模型，很高兴和你交流！有什么想聊的题目或问题，请随时问我！
-
-You >> 介绍一下llama3
-
-ModelLink:
-LLaMA 3 is a large-scale language model developed by Meta AI, a leading artificial intelligence research organization. It's a significant upgrade to the previous LLaMA models, offering improved performance and capabilities.
-
-Here are some key features and facts about LLaMA 3:
-
-Scale: LLaMA 3 is a massive model, consisting of 13.5 billion parameters, making it one of the largest language models in the world.
-Architecture: It's based on the Transformer architecture, which is a type of neural network designed specifically for natural language processing tasks.
-Training data: LLaMA 3 was trained on a massive dataset of text from the internet, including books, articles, and websites. This training data allows the model to learn about various topics, styles, and language patterns.
-Language understanding: LLaMA 3 is capable of understanding natural language, including nuances, idioms, and figurative language. It can also recognize and generate text in multiple languages.
-Conversational capabilities: The model can engage in conversations, answering questions, and responding to prompts. It can also generate text based on a given topic or theme.
-Creative writing: LLaMA
-You >> 能用中文回答吗
-
-ModelLink:
-
-LLaMA 3 是 Meta AI 开发的一种大规模语言模型，是之前 LLama 模型的升级版本，提供了改进的性能和能力。
-
-以下是 LLama 3 的一些关键特点和事实：
-
-规模：LLaMA 3 是一个庞大的模型，具有 13.5 亿参数，世界上最大的语言模型之一。
-架构：它基于 Transformer 架构，这是一种专门为自然语言处理任务设计的神经网络。
-训练数据：LLaMA 3 在互联网上的大量文本数据上进行了训练，包括书籍、文章和网站。这使得模型能够学习各种主题、风格和语言模式。
-语言理解：LLaMA 3 能够理解自然语言，包括 nuance、idiom 和 figurative language。它也能识别和生成多种语言的文本。
-对话能力：模型能够参与对话，回答问题和响应提示。它也能基于给定的主题或主题生成文本。
-创作写作：LLa
-```
-
-## 评估-8B
-
-使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
-配置llama3-8B 评估脚本: examples/llama3/evaluate_llama3_8b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"  #词表路径
-CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"  #模型路径
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/llama3/evaluate_llama3_8b_ptd.sh
-```
-
-评估结果如下
-
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-| :----: | :------: | :------: | :--------: | :-------: |
-|  MMLU  |    57    |  14042  |   0.666   |  0.653  |
-
-# LLAMA3-70B
-
-## 训练
-
-LLAMA3-70B 训练的硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 64 x Ascend NPUs |
-
-### 脚本
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # 安装 torch 和 torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-    
-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
-3. 下载 LLAMA3-70B 的 [预训练权重和词表](https://huggingface.co/v2ray/Llama-3-70B/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/llama-3-70b-hf/
-    cd ./model_from_hf/llama-3-70b-hf/
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/config.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/generation_config.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00001-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00002-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00003-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00004-of-00030.safetensors
-    ...
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00030-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model.safetensors.index.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/special_tokens_map.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer_config.json
-    cd ../../
-    ```
-4. 权重转换
-
-    4.1 将权重从 huggingface 格式转化为 megatron 格式
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```bash
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # 权重格式转换
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 8 \
-        --load-dir ./model_from_hf/llama-3-70b-hf/ \
-        --save-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
-        --tokenizer-model ./model_from_hf/llama-3-70b-hf/tokenizer.json
-    ```
-
-    4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```shell
-    # 请按照您的真实环境修改 set_env.sh 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/llama-3-70b-hf/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/llama-3-70b-hf/mg2hg/
-    ```
-
-    权重转换适用于预训练、微调、推理和评估，根据任务不同调整参数 `target-tensor-parallel-size`和 `target-pipeline-parallel-size`。
-
-5. 预训练
-
-    5.1 准备数据集
-
-    下载 LLaMA3-70B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    # 处理数据   
-    mkdir ./dataset/llama-3-70b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
-        --output-prefix ./dataset/llama-3-70b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 预训练
-    配置llama3-70B 预训练脚本: examples/llama3/pretrain_llama3_70b_ptd.sh
-    ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 根据实际情况配置词表、数据集、模型参数保存路径
-    CKPT_SAVE_DIR="./ckpt/llama-3-70b-hf/"
-    TOKENIZER_MODEL="./model_from_hf/llama-3-70b-hf/"  #词表路径
-    DATA_PATH="./dataset/llama-3-70b-hf/alpaca_text_document"  #数据集路径
-    CKPT_LOAD_DIR="./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/" #权重路径
-    ```
-
-    多机运行增加参数--overlap-grad-reduce
-
-    启动 LLaMA3-70B 预训练脚本: examples/llama3/pretrain_llama3_70b_ptd.sh
-
-    ```shell
-    bash examples/llama3/pretrain_llama3_70b_ptd.sh
-    ```
-
-    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-
-6. 微调
-
-    6.1 准备微调数据集
-    
-    下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # 下载数据集
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # 处理微调数据集  
-    mkdir ./finetune_dataset/llama-3-70b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
-        --output-prefix ./finetune_dataset/llama-3-70b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 全参微调
-    
-    全参微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*
-    增加微调参数--finetune，增加预训练权重加载参数--load，使微调从第一步开始。更改为以下参数：
-
-    ```bash
-    DATA_PATH="./finetune_dataset/llama-3-70b-hf/alpaca"
-    TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
-    CKPT_PATH="./ckpt/llama-3-70b-hf/"
-        --load ${CKPT_PATH} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-### 性能
-
-#### 吞吐
-
-LLaMA3-70B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 |   模型   | 迭代数 | tokens吞吐 (tokens/s/p) |
-| :--: | :-------: | :----: | :---------------------: |
-| NPUs | LLaMA3-70B |  1000  |        283          |
-| 参考 | LLaMA3-70B |  -  |        355          |
-## 推理-70B
-
-配置llama3-70B 推理脚本: examples/llama3/generate_llama3_70b_ptd.sh
-
-```bash
-# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"
-TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
-```
-
-启动llama3-70B 推理脚本
-
-```bash
-bash examples/llama3/generate_llama3_70b_ptd.sh
-```
-
-## 评估-70B
-
-使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
-配置llama3-70B 评估脚本: examples/llama3/evaluate_llama3_70b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"  #词表路径
-CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"  #模型路径
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/llama3/evaluate_llama3_70b_ptd.sh
-```
-
-评估结果如下
-
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-| :----: | :------: | :------: | :--------: | :-------: |
-|  MMLU  |    57    |  14042  |   0.795   |  0.783  |
--- a/examples/llama3/README_en.md
+++ b/examples/llama3/README_en.md
@ -1,608 +0,0 @@
-# LLaMA3  $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
-<p align="left">
-        <b><a href="README.md">简体中文</a></b> |
-        <b>English</b> 
-</p>
-
-#  Contents
-
- [LLaMA3](#llama)
- [Contents](#contents)
- [LLAMA3-8B](#llama3-8b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-  - [Inference-8B](#inference-8b)
-  - [Evaluation-8B](#evaluation-8b)
- [Contents](#contents)
- [LLAMA3-70B](#llama3-70b)
-  - [Training](#training)
-    - [Script](#script)
-    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-  - [Inference-70B](#inference-70b)
-  - [Evaluation-70B](#evaluation-70b)
-
-# LLAMA3-8B
-
-## Training
-
-Here's a hardware summary of pre-training  LLAMA3-8B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               8 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # modify ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-    
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-
-    *Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-
-    ```text
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-    
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-    ```
-3. Prepare pretrained weights and tokenizer
-    Download the LLAMA3-8B checkpoint from [here](https://huggingface.co/unsloth/llama-3-8B/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/llama-3-8b-hf/
-    cd ./model_from_hf/llama-3-8b-hf/
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/config.json
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/generation_config.json
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00001-of-00004.safetensors
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00002-of-00004.safetensors
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00003-of-00004.safetensors
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00004-of-00004.safetensors
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model.safetensors.index.json
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/special_tokens_map.json
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer.json
-    wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer_config.json
-    cd ../../
-    ```
-4. weight conversion in ptd mode
-
-    *Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-3-8b model weight conversion in ptd as an example.*
-
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # convert to ptd weights
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1 \
-        --load-dir ./model_from_hf/llama-3-8b-hf/ \
-        --save-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/llama-3-8b-hf/tokenizer.json
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/llama-3-8b-hf/  # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/llama-3-8b-hf/mg2hg/
-    ```
-
-    Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
-5. pre-training
-
-    5.1 Prepare dataset
-
-    Download the LLAMA3-8B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets  
-    mkdir ./dataset/llama-3-8b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
-        --output-prefix ./dataset/llama-3-8b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 pre-training using ptd mode
-    Config LLAMA3-8B pre-training script: examples/llama3/pretrain_llama3_8b_ptd.sh
-
-    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # modify config according to your own actual situation
-    CKPT_SAVE_DIR="./ckpt/llama-3-8b-hf/"
-    TOKENIZER_MODEL="./model_from_hf/llama-3-8b-hf/"  #tokenizer path
-    DATA_PATH="./dataset/llama-3-8b-hf/alpaca_text_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/" #weight path
-    ```
-
-    Multi-machine training requires the addition of parameter --overlap-grad-reduce
-
-    Launch LLAMA3-8B  pre-training script: examples/llama3/pretrain_llama3_8b_ptd.sh
-
-    ```shell
-    bash examples/llama3/pretrain_llama3_8b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
-
-6. fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-    
-    Download the LLAMA3-8B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # process datasets  
-    mkdir ./finetune_dataset/llama-3-8b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
-        --output-prefix ./finetune_dataset/llama-3-8b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-    
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_llama3_8b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-
-    ```bash
-    DATA_PATH="./finetune_dataset/llama-3-8b-hf/alpaca"
-    TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
-    CKPT_PATH="./ckpt/llama-3-8b-hf/"
-        --load ${CKPT_PATH} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of LLaMA3-8B in **Ascend NPU** and **Reference**:
-
-| Device      | Model         | total Iterations     | throughput rate (tokens/s/p) | 
-| :------:    | :-----------: |:-------------------: | :-------------------------:  | 
-| NPUs        | LLaMA3-8B     | 1000                 | 2483                         |
-| Reference   | LLaMA3-8B     | 1000                 | 2674                         |
-
-
-
-## Inference-8B
-
-Config llama3-8B inference script: examples/llama3/generate_llama3_8b_ptd.sh
-
-```bash
-# modify the script according to your own ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# modify script model path and tokenizer path
-CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"
-TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
-```
-
-Launch llama3-8B inference script: examples/llama3/generate_llama3_8b_ptd.sh
-
-```bash
-bash examples/llama3/generate_llama3_8b_ptd.sh
-```
-
-**When want to chat with Llama3，please download Instruct/Chat weights，and convert to Megatron weights. Then launch the chat script: examples/llama3/generate_llama3_8b_chat_ptd.sh**
-
-Download the LLAMA3-8B-Instruct weight from [here](https://huggingface.co/unsloth/llama-3-8b-Instruct)
-
-Chat example：
-```
-You >> 你好
-ModelLink:
-你好！我是你的中文语言模型，很高兴和你交流！有什么想聊的题目或问题，请随时问我！
-
-You >> 介绍一下llama3
-
-ModelLink:
-LLaMA 3 is a large-scale language model developed by Meta AI, a leading artificial intelligence research organization. It's a significant upgrade to the previous LLaMA models, offering improved performance and capabilities.
-
-Here are some key features and facts about LLaMA 3:
-
-Scale: LLaMA 3 is a massive model, consisting of 13.5 billion parameters, making it one of the largest language models in the world.
-Architecture: It's based on the Transformer architecture, which is a type of neural network designed specifically for natural language processing tasks.
-Training data: LLaMA 3 was trained on a massive dataset of text from the internet, including books, articles, and websites. This training data allows the model to learn about various topics, styles, and language patterns.
-Language understanding: LLaMA 3 is capable of understanding natural language, including nuances, idioms, and figurative language. It can also recognize and generate text in multiple languages.
-Conversational capabilities: The model can engage in conversations, answering questions, and responding to prompts. It can also generate text based on a given topic or theme.
-Creative writing: LLaMA
-You >> 能用中文回答吗
-
-ModelLink:
-
-LLaMA 3 是 Meta AI 开发的一种大规模语言模型，是之前 LLama 模型的升级版本，提供了改进的性能和能力。
-
-以下是 LLama 3 的一些关键特点和事实：
-
-规模：LLaMA 3 是一个庞大的模型，具有 13.5 亿参数，世界上最大的语言模型之一。
-架构：它基于 Transformer 架构，这是一种专门为自然语言处理任务设计的神经网络。
-训练数据：LLaMA 3 在互联网上的大量文本数据上进行了训练，包括书籍、文章和网站。这使得模型能够学习各种主题、风格和语言模式。
-语言理解：LLaMA 3 能够理解自然语言，包括 nuance、idiom 和 figurative language。它也能识别和生成多种语言的文本。
-对话能力：模型能够参与对话，回答问题和响应提示。它也能基于给定的主题或主题生成文本。
-创作写作：LLa
-```
-
-
-## Evaluation-8B
-
-We use MMLU benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/cais/mmlu).
-Config llama3-8B evaluation script: examples/llama3/evaluate_llama3_8b_ptd.sh
-
-```bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"  #tokenizer path
-CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"  #model path
-# configure task and data path
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Launch llama3-8B evaluation script:
-
-```bash
-bash examples/llama3/evaluate_llama3_8b_ptd.sh
-```
-
-Evaluation results
-
-|  dataset | subject_num | question_num | reference_acc |NPU acc|
-|:---:|:-----------:|:------------:|:-------------:|:---:|
-| MMLU |     57      |    14042     |    0.666     |0.653|
-
-# LLAMA3-70B
-
-## Training
-
-Here's a hardware summary of pre-training  LLAMA3-70B:
-
-| Hardware |                      Value                      |
-| :------: | :---------------------------------------------: |
-|   NPU    |               64 x Ascend NPUs                   |
-
-### Script
-
-1. Clone the repository to your local server:
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-2. Build environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-    
-    # install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
-    # modify ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-    
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-
-    *Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
-
-    ```text
-    # original deepspeed/runtime/engine.py, about #Lines2746-2748
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None:
-        return False
-    
-    # modified
-    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-    if zero_sd_list is None or len(zero_sd_list) == 0:
-        return False
-    ```
-3. Prepare pretrained weights and tokenizer
-    Download the LLAMA3-70B checkpoint from [here](https://huggingface.co/v2ray/Llama-3-70B/tree/main)
-
-    ```shell
-    #!/bin/bash
-    mkdir ./model_from_hf/llama-3-70b-hf/
-    cd ./model_from_hf/llama-3-70b-hf/
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/config.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/generation_config.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00001-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00002-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00003-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00004-of-00030.safetensors
-    ...
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00030-of-00030.safetensors
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model.safetensors.index.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/special_tokens_map.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer.json
-    wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer_config.json
-    cd ../../
-    ```
-4. weight conversion in ptd mode
-
-    *Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-3-70b model weight conversion in ptd as an example.*
-
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # convert to ptd weights
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 8 \
-        --load-dir ./model_from_hf/llama-3-70b-hf/ \
-        --save-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
-        --tokenizer-model ./model_from_hf/llama-3-70b-hf/tokenizer.json
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-
-    ```shell
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --save-dir ./model_from_hf/llama-3-70b-hf/  # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/llama-3-70b-hf/mg2hg/
-    ```
-
-    Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
-5. pre-training
-
-    5.1 Prepare dataset
-
-    Download the LLAMA3-70B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    cd ./dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-
-    # process datasets  
-    mkdir ./dataset/llama-3-70b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
-        --output-prefix ./dataset/llama-3-70b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF
-    ```
-
-    5.2 pre-training using ptd mode
-    Config LLAMA3-70B pre-training script: examples/llama3/pretrain_llama3_70b_ptd.sh
-
-    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # modify config according to your own actual situation
-    CKPT_SAVE_DIR="./ckpt/llama-3-70b-hf/"
-    TOKENIZER_MODEL="./model_from_hf/llama-3-70b-hf/"  #tokenizer path
-    DATA_PATH="./dataset/llama-3-70b-hf/alpaca_text_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/" #weight path
-    ```
-
-    Multi-machine training requires the addition of parameter --overlap-grad-reduce
-
-    Launch LLAMA3-70B  pre-training script: examples/llama3/pretrain_llama3_70b_ptd.sh
-
-    ```shell
-    bash examples/llama3/pretrain_llama3_70b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
-
-6. fine-tuning
-
-    6.1 Prepare fine-tuning dataset
-    
-    Download the LLAMA3-70B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-    ```shell
-    # download datasets
-    mkdir finetune_dataset
-    cd ./finetune_dataset
-    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-    cd ..
-    
-    # process datasets  
-    mkdir ./finetune_dataset/llama-3-70b-hf/
-    python ./tools/preprocess_data.py \
-        --input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-        --tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
-        --output-prefix ./finetune_dataset/llama-3-70b-hf/alpaca \
-        --workers 4 \
-        --log-interval 1000 \
-        --tokenizer-type PretrainedFromHF \
-        --handler-name GeneralInstructionHandler \
-        --append-eod
-    ```
-
-    6.2 Full Parameters Fine-Tuning
-    
-    The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_llama3_70b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-    Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
-
-    ```bash
-    DATA_PATH="./finetune_dataset/llama-3-70b-hf/alpaca"
-    TOKENIZER_PATH="/model_from_hf/llama-3-70b-hf/"
-    CKPT_PATH="./ckpt"
-        --load ${CKPT_PATH} \
-        --finetune \
-        --is-instruction-dataset \
-        --tokenizer-not-use-fast \
-    ```
-
-### Performance
-
-#### Machine performance
-
-The performance of LLaMA3-70B in **Ascend NPU** and **Reference**:
-
-| Device      | Model         | total Iterations     | throughput rate (tokens/s/p) | 
-| :------:    | :-----------: |:-------------------: | :-------------------------:  | 
-| NPUs        | LLaMA3-70B     | 1000                 | 283                         |
-| Reference   | LLaMA3-70B     | -                    | 355                           |
-
-
-
-## Inference-70B
-
-Config llama3-70B inference script: examples/llama3/generate_llama3_70b_ptd.sh
-
-```bash
-# modify the script according to your own ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
- 
-# modify script model path and tokenizer path
-CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"
-TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
-```
-
-Launch llama3-70B inference script: examples/llama3/generate_llama3_70b_ptd.sh
-
-```bash
-bash examples/llama3/generate_llama3_70b_ptd.sh
-```
-
-
-## Evaluation-70B
-
-We use MMLU benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/cais/mmlu).
-Config llama3-70B evaluation script: examples/llama3/evaluate_llama3_70b_ptd.sh
-
-```bash
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"  #tokenizer path
-CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"  #model path
-# configure task and data path
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Launch llama3-70B evaluation script:
-
-```bash
-bash examples/llama3/evaluate_llama3_70b_ptd.sh
-```
-
-Evaluation results
-
-|  dataset | subject_num | question_num | reference_acc |NPU acc|
-|:---:|:-----------:|:------------:|:-------------:|:-------:|
-| MMLU |     57      |    14042     |    0.795     |  0.783  |
--- a/examples/llama3/evaluate_llama3_70b_ptd.sh
+++ b/examples/llama3/evaluate_llama3_70b_ptd.sh
@ -1,60 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="your tokenizer directory path"
-CHECKPOINT="your model directory path"
-
-# configure task and data path
-DATA_PATH="/../mmlu/test/"
-TASK="mmlu"
-
-# distributed config
-MASTER_ADDR=localhost
-MASTER_PORT=6011
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-# configure generation parameters
-python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --load ${CHECKPOINT}  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --max-new-tokens 1 \
-       --evaluation-batch-size 1 \
-       --micro-batch-size 1  \
-       --use-fused-rmsnorm \
-       --no-masked-softmax-fusion \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --num-layers 80  \
-       --hidden-size 8192  \
-       --ffn-hidden-size 28672 \
-       --num-attention-heads 64  \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --swiglu \
-       --disable-bias-linear \
-       --position-embedding-type rope \
-       --rotary-base 500000 \
-       --normalization RMSNorm \
-       --untie-embeddings-and-output-weights \
-       --make-vocab-size-divisible-by 16032 \
-       --bf16  \
-       --seed 42 | tee logs/evaluation_llama3_70b_${TASK}.log
-
-
-
--- a/examples/llama3/evaluate_llama3_8b_ptd.sh
+++ b/examples/llama3/evaluate_llama3_8b_ptd.sh
@ -1,60 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-
-
-# modify script model path and tokenizer path
-TOKENIZER_PATH="your tokenizer directory path"
-CHECKPOINT="your model directory path"
-
-# configure task and data path
-DATA_PATH="/../mmlu/test/"
-TASK="mmlu"
-
-# distributed config
-MASTER_ADDR=localhost
-MASTER_PORT=6011
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-# configure generation parameters
-python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --load ${CHECKPOINT}  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --max-new-tokens 1 \
-       --evaluation-batch-size 1 \
-       --micro-batch-size 1  \
-       --use-fused-rmsnorm \
-       --no-masked-softmax-fusion \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --num-layers 32  \
-       --hidden-size 4096  \
-       --ffn-hidden-size 14336 \
-       --num-attention-heads 32  \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --swiglu \
-       --disable-bias-linear \
-       --position-embedding-type rope \
-       --rotary-base 500000 \
-       --normalization RMSNorm \
-       --untie-embeddings-and-output-weights \
-       --make-vocab-size-divisible-by 16032 \
-       --bf16  \
-       --seed 42 | tee logs/evaluation_llama3_8b_${TASK}.log
-
-
-
--- a/examples/llama3/generate_llama3_70b_ptd.sh
+++ b/examples/llama3/generate_llama3_70b_ptd.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export WITHOUT_JIT_COMPILE=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --use-fused-swiglu \
-       --use-rotary-position-embeddings \
-       --use-fused-rotary-pos-emb \
-       --load ${CHECKPOINT}  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --num-layers 80 \
-       --hidden-size 8192  \
-       --ffn-hidden-size 28672 \
-       --position-embedding-type rope \
-       --rotary-base 500000 \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --micro-batch-size 1 \
-       --num-attention-heads 64  \
-       --swiglu \
-       --normalization RMSNorm \
-       --norm-epsilon 1e-5 \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --untie-embeddings-and-output-weights \
-       --disable-bias-linear \
-       --attention-softmax-in-fp32 \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 16032 \
-       --bf16 \
-       --seed 42 \
-       | tee logs/generate_llama3_70b.log
-
--- a/examples/llama3/generate_llama3_8b_chat_ptd.sh
+++ b/examples/llama3/generate_llama3_8b_chat_ptd.sh
@ -1,64 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export WITHOUT_JIT_COMPILE=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --task chat \
-       --hf-chat-template \
-       --add-eos-token '<|eot_id|>' \
-       --top-p 0.9 \
-       --temperature 0.6 \
-       --use-fused-swiglu \
-       --use-rotary-position-embeddings \
-       --use-fused-rotary-pos-emb \
-       --load ${CHECKPOINT}  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --num-layers 32 \
-       --hidden-size 4096  \
-       --ffn-hidden-size 14336 \
-       --position-embedding-type rope \
-       --rotary-base 500000 \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --micro-batch-size 1 \
-       --num-attention-heads 32  \
-       --swiglu \
-       --normalization RMSNorm \
-       --norm-epsilon 1e-5 \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --untie-embeddings-and-output-weights \
-       --disable-bias-linear \
-       --attention-softmax-in-fp32 \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 16032 \
-       --bf16 \
-       --seed 42 \
-       | tee logs/generate_llama3_8b.log
-
--- a/examples/llama3/generate_llama3_8b_ptd.sh
+++ b/examples/llama3/generate_llama3_8b_ptd.sh
@ -1,59 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export WITHOUT_JIT_COMPILE=1
-
-# please fill these path configurations
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer directory path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  \
-       --use-fused-swiglu \
-       --use-rotary-position-embeddings \
-       --use-fused-rotary-pos-emb \
-       --load ${CHECKPOINT}  \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --num-layers 32 \
-       --hidden-size 4096  \
-       --ffn-hidden-size 14336 \
-       --position-embedding-type rope \
-       --rotary-base 500000 \
-       --seq-length 8192 \
-       --max-position-embeddings 8192 \
-       --max-new-tokens 256 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --micro-batch-size 1 \
-       --num-attention-heads 32  \
-       --swiglu \
-       --normalization RMSNorm \
-       --norm-epsilon 1e-5 \
-       --hidden-dropout 0 \
-       --attention-dropout 0 \
-       --untie-embeddings-and-output-weights \
-       --disable-bias-linear \
-       --attention-softmax-in-fp32 \
-       --exit-on-missing-checkpoint \
-       --make-vocab-size-divisible-by 16032 \
-       --bf16 \
-       --seed 42 \
-       | tee logs/generate_llama3_8b.log
-
--- a/examples/llama3/pretrain_llama3_70b_ptd.sh
+++ b/examples/llama3/pretrain_llama3_70b_ptd.sh
@ -1,96 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=8
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=8
-PP=8
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --micro-batch-size 1 \
-    --global-batch-size 512 \
-    --sequence-parallel \
-    --use-flash-attn \
-    --use-rotary-position-embeddings \
-    --use-fused-rotary-pos-emb \
-    --use-fused-rmsnorm \
-    --use-fused-swiglu \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --num-layers 80 \
-    --hidden-size 8192 \
-    --ffn-hidden-size 28672 \
-    --num-attention-heads 64 \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --seq-length 8192 \
-    --max-position-embeddings 8192 \
-    --make-vocab-size-divisible-by 16032 \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --rotary-base 500000 \
-    --normalization RMSNorm \
-    --norm-epsilon 1e-5 \
-    --swiglu \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --lr 1.25e-6 \
-    --train-iters 1000 \
-    --lr-decay-style cosine \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 10000 \
-    --eval-interval 10000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load ${CKPT_LOAD_DIR} \
-    --save ${CKPT_SAVE_DIR} \
-    | tee logs/train_llama3_70b.log
--- a/examples/llama3/pretrain_llama3_8b_ptd.sh
+++ b/examples/llama3/pretrain_llama3_8b_ptd.sh
@ -1,96 +0,0 @@
-#!/bin/bash
-
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
-
-CKPT_SAVE_DIR="your model save ckpt path"
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_LOAD_DIR="your model ckpt path"
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --micro-batch-size 2 \
-    --global-batch-size 64 \
-    --sequence-parallel \
-    --use-flash-attn \
-    --use-rotary-position-embeddings \
-    --use-fused-rotary-pos-emb \
-    --use-fused-rmsnorm \
-    --use-fused-swiglu \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --num-layers 32 \
-    --hidden-size 4096 \
-    --ffn-hidden-size 14336 \
-    --num-attention-heads 32 \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --seq-length 8192 \
-    --max-position-embeddings 8192 \
-    --make-vocab-size-divisible-by 16032 \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --rotary-base 500000 \
-    --normalization RMSNorm \
-    --norm-epsilon 1e-5 \
-    --swiglu \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --lr 1.25e-6 \
-    --train-iters 2000 \
-    --lr-decay-style cosine \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --adam-beta2 0.95 \
-    --initial-loss-scale 4096 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH \
-    --split 100,0,0
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 10000 \
-    --eval-interval 10000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-    $GPT_ARGS \
-    $DATA_ARGS \
-    $OUTPUT_ARGS \
-    --distributed-backend nccl \
-    --load ${CKPT_LOAD_DIR} \
-    --save ${CKPT_SAVE_DIR} \
-    | tee logs/train_llama3_8b.log
--- a/examples/mistral/README.md
+++ b/examples/mistral/README.md
@ -1,301 +0,0 @@
-# Mistral-7B-32K【昇腾贡献模型】
-
-<p align="left">
-        <b>简体中文</b> |
-        <b><a href="README_en.md">English</a> </b> 
-</p>
-
-# 目录
-
- [Mistral](#mistral)
- [目录](#目录)
- [7B](#7B-32K)
-  - [硬件要求](#硬件要求)
-  - [准备工作](#准备工作)
-  - [模型训练](#模型训练)
-  - [模型性能](#模型性能)
-    - [吞吐](#吞吐)
-  - [模型推理](#模型推理)
-  - [模型评估](#模型评估)
-
-# 7B-32K
-
-## 硬件要求
-
-训练的推荐硬件配置:
-
-| 硬件 |       配置       |
-| :--: | :--------------: |
-| NPU | 8 x Ascend NPUs |
-
-推理的推荐硬件配置:
-
-| 硬件 |      配置      |
-| :--: | :-------------: |
-| NPU | 8 x Ascend NPUs |
-
-## 准备工作
-
-1. 克隆仓库到本地服务器
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. 搭建环境
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # 安装 torch 和 torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # 安装加速库
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-
-    # 安装其余依赖库
-    pip install -r requirements.txt 
-    ```
-
-3. 下载 Mistral-7B 的 [预训练权重和词表](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main)（*建议仅下载使用safetensors格式的权重*）
-
-    ```shell
-    #!/bin/bash
-    cd ./model_from_hf/
-    git lfs install
-    git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
-    cd ..
-    ```
-
-4. 权重转换
-    HuggingFace权重 --> 任意并行切分策略的Megatron权重
-    ***（该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）***
-
-    ```bash
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # HF 转 tp8-pp1
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --load-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-        --save-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Mistral-7B-Instruct-v0.2/tokenizer.model \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1
-    ```
-
-
-    任意并行切分策略的Megatron权重 --> HuggingFace权重
-    ***（该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）***
-
-    ```bash
-    # 修改 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # tp8-pp1 转 HF
-    python tools/checkpoint/convert_ckpt.py \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
-        --save-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/    # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/Mistral-7B-Instruct-v0.2/mg2hg/
-    ```
-
-## 模型训练
-
-准备数据集
-
-下载Alpaca[数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
-
-```shell
-# 下载数据
-cd ./dataset
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-# 处理数据   
-mkdir ./dataset/Mistral-7B/
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-    --output-prefix ./dataset/Mistral-7B/alpaca \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF
-```
-
-配置 Mistral-7B 预训练脚本：***examples/mistral/pretrain_mistral_7b_ptd.sh***
-
-```shell
-# 设置 ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 根据实际情况配置词表、数据集、模型参数保存路径
-DATA_PATH="./dataset/Mistral-7B/alpaca_text_document"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-CKPT_SAVE_DIR="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-
-# 根据分布式集群实际情况配置分布式参数
-GPUS_PER_NODE=8
-MASTER_ADDR="your master node IP"
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK="current node id"
-WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
-
-# 训练并行策略
-TP=8
-PP=1
-```
-
-启动 Mistral-7B 预训练脚本: ***examples/pretrain_mistral_7b_ptd.sh***
-
-```shell
-bash examples/mistral/pretrain_mistral_7b_ptd.sh
-```
-
-**注意**：
-1. 如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加--no-shared-storage参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
-2. pretrain_mistral_7b_ptd.sh脚本里的训练超参需要根据实际情况调整，例如global-batch-size在预训练中需要设置的更大才能达到更好的效果，比如256。
-
-微调
-
-下载微调数据集 [这里](https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl)
-
-```shell
-# 下载数据集
-mkdir finetune_dataset
-cd ./finetune_dataset
-wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
-cd ..
-
-# 处理微调数据集  
-mkdir ./finetune_dataset/Mistral-7B/
-python ./tools/preprocess_data.py \
-    --input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
-    --output-prefix ./finetune_dataset/Mistral-7B/alpaca \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-    --append-eod \
-    --tokenizer-not-use-fast \
-    --handler-name GeneralInstructionHandler \
-    --workers 4
-```
-
-
-指令微调
-
-微调的配置脚本基本和预训练脚本一致. *区别是数据集，以及增加训练参数--is-instruction-dataset
-
-增加微调参数--finetune，增加预训练权重加载参数--load，使微调从第一步开始：
-
-```bash
-DATA_PATH="./finetune_dataset/Mistral-7B/alpaca"
-CKPT_PATH="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset
-```
-
-## 模型性能
-
-### 吞吐
-
-Mistral-7B-32K(**开启SWA 4096**)在单机8卡上(tp8 pp1) **昇腾芯片** 和 **参考芯片** 上的性能对比：
-
-| 设备 |     模型     | 迭代数 | 样本吞吐 (samples/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 显存占用/p |
-| :--: | :----------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
-| NPUs | Mistral-7B 32K |  1000  |         0.69           |        2806          |         46.7         |        ~44642MB        |
-| 参考 | Mistral-7B 32K |  1000  |          0.67          |        2734          |        48.0          |         ~65500MB         |
-
-## 模型推理
-
-首先需要配置推理脚本: ***examples/mistral/generate_mistral_7b_ptd.sh***
-
-```bash
-# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型权重路径和词表路径
-CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-
-# 根据实际加载的模型权重修改并行配置
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-GPUS_PER_NODE=8
-TP=8
-PP=1
-
-# 注意
-该模型为指令遵从度训练后需要配合模板使用，基本操作同上，仅inference-prompt-type有变化
--inference-prompt-type mixtral
-```
-
-然后可直接启动
-
-```bash
-bash examples/mistral/generate_mistral_7b_ptd.sh
-```
-
-推理的示例如下:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mistral/generate_demo.png)
-
-## 模型评估
-
-使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
-配置评估脚本: examples/mistral/evaluate_mistral_7b_ptd.sh
-
-```bash
-# ascend-toolkit 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# 修改模型参数路径和词表路径
-CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-# 配置任务和数据集路径
-DATA_PATH="./mmlu/test/"
-TASK="mmlu"
-```
-
-启动评估
-
-```bash
-bash examples/mistral/evaluate_mistral_7b_ptd.sh
-```
-
-评估结果如下
-
-| 数据集 | 总问题数 | 参考准确率 | NPU准确率 |
-| :----: | :------: | :--------: | :-------: |
-|  MMLU  |  14042  |   0.563   |   0.563   |
-
--- a/examples/mistral/README_en.md
+++ b/examples/mistral/README_en.md
@ -1,301 +0,0 @@
-# Mistral-7B-32K
-
-<p align="left">
-        <b><a href="README.md">简体中文</a> </b> |
-        <b>English</b> 
-</p>
-
-# Table of Contents
-
- [Mistral](#mistral)
- [Table of Contents](#table-of-contents)
- [7B](#7B-32K)
-  - [Hardware-Requirements](#hardware-requirements)
-  - [Preparation](#preparation)
-  - [Model-Training](#model-training)
-  - [Model-Performance](#model-performance)
-    - [Throughput](#throughput)
-  - [Model-Inference](#model-inference)
-  - [Model-Evaluation](#model-evaluation)
-
-# 7B-32K
-
-## Hardware-Requirements
-
-Minimum hardware requirements for training:
-
-| Hardware |  Configuration  |
-| :------: |:---------------:|
-|   NPU   | 8 x Ascend NPUs |
-
-Recommended hardware configuration for inference:
-
-| Hardware |  Configuration  |
-| :------: | :-------------: |
-|   NPU   | 8 x Ascend NPUs |
-
-## Preparation
-
-1. Clone the code repository to the local server
-
-    ```shell
-    git clone https://gitee.com/ascend/ModelLink.git
-    git clone https://github.com/NVIDIA/Megatron-LM.git
-    cd Megatron-LM
-    git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
-    cd ..
-    cd ModelLink
-    mkdir logs
-    mkdir model_from_hf
-    mkdir dataset
-    mkdir ckpt
-    ```
-
-2. Set up the environment
-
-    ```bash
-    # python3.8
-    conda create -n test python=3.8
-    conda activate test
-
-    # Install torch and torch_npu
-    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
-    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-
-    # modify the path according to your own  ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # install MindSpeed
-    git clone https://gitee.com/ascend/MindSpeed.git
-    cd MindSpeed
-    git checkout 2b0edd2
-    pip install -r requirements.txt 
-    pip3 install -e .
-    cd ..
-
-    # install other packages
-    pip install -r requirements.txt 
-    ```
-
-3. Download the pre-trained weights and vocabulary for Mistral-7B from [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main). (It is recommended to only download weights in safetensors format)
-
-    ```shell
-    #!/bin/bash
-    cd ./model_from_hf/
-    git lfs install
-    git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
-    cd ..
-    ```
-
-4. Weight conversion
-
-    HuggingFace weights --> Megatron weights with any parallel slicing strategy
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
-
-    ```bash
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # HF to tp8-pp1
-    python tools/checkpoint/convert_ckpt.py \
-        --model-type GPT \
-        --loader llama2_hf \
-        --saver megatron \
-        --load-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-        --save-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
-        --tokenizer-model ./model_from_hf/Mistral-7B-Instruct-v0.2/tokenizer.model \
-        --target-tensor-parallel-size 8 \
-        --target-pipeline-parallel-size 1
-    ```
-
-    Any Megatron weights with parallel slicing strategy --> HuggingFace weights
-    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
-
-    ```bash
-    # Modify the ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-
-    # tp8-pp1 to HF
-    python tools/checkpoint/convert_ckpt.py \
-        --target-tensor-parallel-size 1 \
-        --target-pipeline-parallel-size 1 \
-        --model-type GPT \
-        --loader megatron \
-        --saver megatron \
-        --save-model-type save_huggingface_llama \
-        --load-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
-        --save-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/    # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Mistral-7B-Instruct-v0.2/mg2hg/
-    ```
-
-## Model-Training
-
-Prepare dataset
-
-Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.
-
-```shell
-# download datasets
-cd ./dataset
-wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
-cd ..
-# process datasets
-mkdir ./dataset/Mistral-7B/
-python ./tools/preprocess_data.py \
-    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
-    --tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-    --output-prefix ./dataset/Mistral-7B/alpaca \
-    --workers 4 \
-    --log-interval 1000 \
-    --tokenizer-type PretrainedFromHF
-```
-
-Configure Mistral-7B pre-training script: ***examples/mistral/pretrain_mistral_7b_ptd.sh***
-
-```shell
-# Set the ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# Configure according to the actual vocabulary, dataset, and model parameter save path
-DATA_PATH="./dataset/Mistral-7B/alpaca_text_document"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-CKPT_SAVE_DIR="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-
-# Configure distributed parameters according to the actual distributed cluster
-GPUS_PER_NODE=8
-MASTER_ADDR="your master node IP"
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK="current node id"
-WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
-
-# Training parallel strategy
-TP=8
-PP=1
-```
-
-Start Mistral-7B pre-training script: ***examples/pretrain_mistral_7b_ptd.sh***
-
-```shell
-bash examples/mistral/pretrain_mistral_7b_ptd.sh
-```
-
-**Note**: 
-1. If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
-2. The hyperparameters for training in the pretrain_mistral_7b_ptd.sh script need to be adjusted according to actual situations. For example, the global-batch-size needs to be set larger during pre-training to achieve better results, such as 256.
-
-Fine-Tuning
-
-Prepare fine-tuning dataset
-Download the fine-tuning datasets from [here](https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl)
-
-```shell
-# download datasets
-mkdir finetune_dataset
-cd ./finetune_dataset
-wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
-cd ..
-
-# process datasets  
-mkdir ./finetune_dataset/Mistral-7B/
-python ./tools/preprocess_data.py \
-    --input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
-    --output-prefix ./finetune_dataset/Mistral-7B/alpaca \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
-    --append-eod \
-    --tokenizer-not-use-fast \
-    --handler-name GeneralInstructionHandler \
-    --workers 4
-```
-
-Supervised Fine-Tuning
-
-The configuration script for full parameters fine-tuning  is basically the same as that for pretrain shell. *The difference is that the dataset and the training parameter is-instruction-dataset are added.*
-
-Add the fine-tuning parameter `--finetune` and add pretrained-weight load parameter `--load`, so that fine-tuning starts from the first step.
-
-```shell
-DATA_PATH="./finetune_dataset/Mistral-7B/alpaca"
-CKPT_PATH="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset
-```
-
-## Model-Performance
-
-### Throughput
-
-Comparison of Mistral-7B-32K(**SWA 4096**) performance on 1 nodes and 8 chips with tp8 pp1:
-
-|  Device  |    Model    | Iterations | Sample Throughput (samples/step) | Tokens Throughput (tokens/s/p) | Single Step Iteration Time (s/step) | Memory usage/p |
-| :--: | :----------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
-| NPUs | Mistral-7B 32K |  1000  |         0.69           |        2806          |         46.7         |        ~44642MB        |
-| Reference | Mistral-7B 32K |  1000  |          0.67          |        2734          |        48.0          |         ~65500MB         |
-
-## Model-Inference
-
-First, configure the inference script: ***examples/mistral/generate_mistral_7b_ptd.sh***
-
-```bash
-# Execute set_env.sh according to your own ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# Modify the model weight path and tokenizer path
-CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-
-# Modify according to the actual loaded model weight the parallel configuration
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-GPUS_PER_NODE=8
-TP=8
-PP=1
-
-# Note
-This model used in this document is an L1 model that requires instruction compliance training and needs to be used with templates. The basic operations are the same as above, only the startup entry has changed:
--inference-prompt-type mixtral
-```
-
-Then you can start it directly
-
-```bash
-bash examples/mistral/generate_mistral_7b_ptd.sh
-```
-
-An example of inference is as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mistral/generate_demo.png)
-
-## Model-Evaluation
-
-Evaluate the model using the MMLU dataset. Dataset download path [here](https://huggingface.co/datasets/cais/mmlu).
-Configure the evaluation script: ***examples/mistral/evaluate_mistral_7b_ptd.sh***
-
-```bash
-# Ascend-toolkit path
-source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-# Modify the model parameter path and tokenizer path
-CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
-TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
-
-# Configure tasks and dataset paths
-DATA_PATH="./mmlu/data/test/"
-TASK="mmlu"
-```
-
-Start the evaluation
-
-```bash
-bash examples/mistral/evaluate_mistral_7b_ptd.sh
-```
-
-The evaluation results are as follows
-
-| Dataset | Dataset | Refer Accuracy | Ours |
-| :-----: | :-----: | :------------: | :---: |
-|  MMLU  |  14042  |   0.563   |   0.563   |
--- a/examples/mistral/evaluate_mistral_7b_ptd.sh
+++ b/examples/mistral/evaluate_mistral_7b_ptd.sh
@ -1,66 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-export TOKENIZERS_PARALLELISM=false
-
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-GPUS_PER_NODE=8
-TP=8
-PP=1
-
-CHECKPOINT="Your ckpt file path"
-TOKENIZER_PATH="Your vocab file path"
-DATA_PATH="Your data path (such as ./mmlu/test/)"
-TASK="mmlu"
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --task $TASK \
-    --task-data-path $DATA_PATH \
-    --max-new-tokens 1 \
-    --num-layers 32 \
-    --hidden-size 4096 \
-    --ffn-hidden-size 14336 \
-    --num-attention-heads 32 \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_PATH} \
-    --seq-length 4096 \
-    --max-position-embeddings 32768 \
-    --micro-batch-size 1 \
-    --make-vocab-size-divisible-by 1 \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --swiglu \
-    --no-gradient-accumulation-fusion \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --load ${CHECKPOINT}  \
-    --no-load-optim \
-    --no-load-rng \
-    --bf16 \
-    --seed 42
-"
-
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-    $GPT_ARGS \
-    --distributed-backend nccl | tee logs/evaluation_mixtral_${TASK}.log
--- a/examples/mistral/generate_mistral_7b_ptd.sh
+++ b/examples/mistral/generate_mistral_7b_ptd.sh
@ -1,65 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# please fill these path configurations
-CHECKPOINT="your model ckpt path"
-TOKENIZER_MODEL="your tokenizer path"
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-GPUS_PER_NODE=8
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --num-layers 32 \
-    --hidden-size 4096 \
-    --sliding-window 4096 \
-    --ffn-hidden-size 14336 \
-    --num-attention-heads 32 \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --seq-length 4096 \
-    --max-position-embeddings 32768 \
-    --micro-batch-size 1 \
-    --make-vocab-size-divisible-by 1 \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --swiglu \
-    --no-gradient-accumulation-fusion \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --load ${CHECKPOINT}  \
-    --no-load-optim \
-    --no-load-rng \
-    --bf16
-"
-
-torchrun $DISTRIBUTED_ARGS inference.py \
-    $GPT_ARGS \
-    $MOE_ARGS \
-    --distributed-backend nccl \
-    --inference-prompt-type mixtral \
-    | tee logs/generate_mixtral.log
--- a/examples/mistral/pretrain_mistral_7b_ptd.sh
+++ b/examples/mistral/pretrain_mistral_7b_ptd.sh
@ -1,102 +0,0 @@
-#!/bin/bash
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-GPUS_PER_NODE=8
-MASTER_ADDR="your master node IP"
-MASTER_PORT=6000
-NNODES=1
-NODE_RANK=0
-WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $GPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-echo "NODE_RANK ${NODE_RANK}"
-
-DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
-CKPT_SAVE_DIR="your model save ckpt path"
-CKPT_LOAD_DIR="your model ckpt path"
-
-
-TP=8
-PP=1
-NUM_LAYERS=32
-
-GPT_ARGS="
-    --tensor-model-parallel-size ${TP} \
-    --pipeline-model-parallel-size ${PP} \
-    --sequence-parallel \
-    --sliding-window 4096 \
-    --num-layers ${NUM_LAYERS} \
-    --hidden-size 4096 \
-    --ffn-hidden-size 14336 \
-    --num-attention-heads 32 \
-    --group-query-attention \
-    --num-query-groups 8 \
-    --tokenizer-type PretrainedFromHF \
-    --tokenizer-name-or-path ${TOKENIZER_MODEL} \
-    --seq-length 32768 \
-    --max-position-embeddings 32768 \
-    --micro-batch-size 1 \
-    --global-batch-size 32 \
-    --make-vocab-size-divisible-by 1 \
-    --lr 1.25e-6 \
-    --train-iters 1000 \
-    --lr-decay-style cosine \
-    --untie-embeddings-and-output-weights \
-    --disable-bias-linear \
-    --attention-dropout 0.0 \
-    --init-method-std 0.01 \
-    --hidden-dropout 0.0 \
-    --position-embedding-type rope \
-    --normalization RMSNorm \
-    --use-fused-rmsnorm \
-    --use-fused-swiglu \
-    --use-rotary-position-embeddings \
-    --use-fused-rotary-pos-emb \
-    --use-mc2 \
-    --swiglu \
-    --use-flash-attn \
-    --no-masked-softmax-fusion \
-    --attention-softmax-in-fp32 \
-    --min-lr 1.25e-7 \
-    --weight-decay 1e-1 \
-    --lr-warmup-fraction 0.01 \
-    --clip-grad 1.0 \
-    --adam-beta1 0.9 \
-    --initial-loss-scale 65536 \
-    --adam-beta2 0.95 \
-    --no-gradient-accumulation-fusion \
-    --no-load-optim \
-    --no-load-rng \
-    --use-distributed-optimizer \
-    --overlap-grad-reduce \
-    --load ${CKPT_LOAD_DIR} \
-    --save ${CKPT_SAVE_DIR} \
-    --bf16
-"
-
-DATA_ARGS="
-    --data-path $DATA_PATH  \
-    --split 100,0,0 \
-"
-
-OUTPUT_ARGS="
-    --log-interval 1 \
-    --save-interval 1000 \
-    --eval-interval 1000 \
-    --eval-iters 0 \
-"
-
-torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
-  $GPT_ARGS \
-  $DATA_ARGS \
-  $OUTPUT_ARGS \
-  --distributed-backend nccl \
-  | tee logs/train_mistral.log
--- a/examples/mixtral/README.md
+++ b/examples/mixtral/README.md
@ -42,10 +42,11 @@
    git clone https://gitee.com/ascend/ModelLink.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
-    git checkout core_r0.6.0
+    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -70,7 +71,7 @@
    # 安装加速库
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
-    git checkout 2b0edd2
+    git checkout 224ae35e8fc96778f957029d1371ddb623452a50
    pip install -r requirements.txt 
    pip3 install -e .
    cd ..
@ -99,7 +100,7 @@
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    # HF 转 tp8-pp4-ep1
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_hf \
        --saver mixtral \
@ -118,8 +119,8 @@
    # 修改 ascend-toolkit 路径
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
-    # tp1-pp8-ep2 转 tp1-pp8-ep1
-    python tools/checkpoint/convert_ckpt.py \
+    # tp8-pp4-ep1 转 tp8-pp1-ep1
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_mg \
        --saver mixtral \
@ -138,7 +139,7 @@
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    # tp8-pp4-ep1 转 HF
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_mg \
        --saver mixtral \
@ -194,7 +195,7 @@
    EP=1
    ```

-    启动 Mixtral-8x7B 预训练脚本: ***examples/pretrain_mixtral_8x7b_ptd.sh***
+    启动 Mixtral-8x7B 预训练脚本: ***examples/mixtral/pretrain_mixtral_8x7b_ptd.sh***

    ```shell
    bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
@ -254,7 +255,7 @@ Mixtral-8x7B 在四机32卡上(tp8 pp4) **昇腾芯片** 和 **参考芯片**

 ## 模型推理

-首先需要配置推理脚本: ***examples/mixtral/generate_mixtral_8x7b_ptd.sh***
+首先需要配置推理脚本: ***tasks/inference/generate_mixtral_8x7b_ptd.sh***

 ```bash
 # 根据您自己的 ascend-toolkit 路径，执行set_env.sh
@ -283,16 +284,16 @@ PP=1
 然后可直接启动

 ```bash
-bash examples/mixtral/generate_mixtral_8x7b_ptd.sh
+bash tasks/inference/generate_mixtral_8x7b_ptd.sh
 ```

 推理的示例如下:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mixtral/generate_demo.png)
+![Inference](../../sources/images/mixtral/generate_demo.png)

 ## 模型评估

 使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
-配置评估脚本: examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
+配置评估脚本: tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh

 ```bash
 # ascend-toolkit 路径
@ -309,7 +310,7 @@ TASK="mmlu"
 启动评估

 ```bash
-bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
+bash tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh
 ```

 评估结果如下
--- a/examples/mixtral/README_en.md
+++ b/examples/mixtral/README_en.md
@ -42,10 +42,11 @@ Recommended hardware configuration for inference:
    git clone https://gitee.com/ascend/ModelLink.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
-    git checkout core_r0.6.0
+    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -70,7 +71,7 @@ Recommended hardware configuration for inference:
    # install MindSpeed
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
-    git checkout 2b0edd2
+    git checkout 224ae35e8fc96778f957029d1371ddb623452a50
    pip install -r requirements.txt 
    pip3 install -e .
    cd ..
@ -100,12 +101,12 @@ Recommended hardware configuration for inference:
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    # HF to tp8-pp4-ep1
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_hf \
        --saver mixtral \
        --load-dir ./model_from_hf/Mixtral-8x7B/ \
-        --save-dir ./model_weights/Mixtral-8x7B-v0.1-tp8-pp4-ep1/ \
+        --save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-tp8-pp4/ \
        --tokenizer-model ./model_from_hf/Mixtral-8x7B/tokenizer.model \
        --target-tensor-parallel-size 8 \
        --target-pipeline-parallel-size 4 \
@ -113,14 +114,14 @@ Recommended hardware configuration for inference:
    ```

    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
-    ***(This scenario is generally used to reconfigure the sliced model weights, such as training on a four-node 32-card TP8-PP4 strategy, and then wanting to infer on a single-node 8-card TP8)***
+    ***(This scenario is generally used to reconfigure the sliced model weights, such as training on a dual-node 16-card EP2-PP8 strategy, and then wanting to infer on a single-node 8-card TP8)***

    ```bash
    # Modify the ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    # tp8-pp4-ep1 to tp8-pp1-ep1
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_mg \
        --saver mixtral \
@ -139,7 +140,7 @@ Recommended hardware configuration for inference:
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    # tp8-pp4-ep1 to HF
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader mixtral_mg \
        --saver mixtral \
@ -195,7 +196,7 @@ Recommended hardware configuration for inference:
    EP=1
    ```

-    Start Mixtral-8x7B pre-training script: ***examples/pretrain_mixtral_8x7b_ptd.sh***
+    Start Mixtral-8x7B pre-training script: ***examples/mixtral/pretrain_mixtral_8x7b_ptd.sh***

    ```shell
    bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
@ -255,7 +256,7 @@ Comparison of Mixtral-8x7B performance on 4 nodes and 32 chips with tp8 pp4:

 ## Model-Inference

-First, configure the inference script: ***examples/mixtral/generate_mixtral_8x7b_ptd.sh***
+First, configure the inference script: ***tasks/inference/generate_mixtral_8x7b_ptd.sh***

 ```bash
 # Execute set_env.sh according to your own ascend-toolkit path
@ -284,16 +285,16 @@ torchrun $DISTRIBUTED_ARGS inference.py
 Then you can start it directly

 ```bash
-bash examples/mixtral/generate_mixtral_8x7b_ptd.sh
+bash tasks/inference/generate_mixtral_8x7b_ptd.sh
 ```

 An example of inference is as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mixtral/generate_demo.png)
+![Inference](../../sources/images/mixtral/generate_demo.png)

 ## Model-Evaluation

 Evaluate the model using the MMLU dataset. Dataset download path [here](https://huggingface.co/datasets/cais/mmlu).
-Configure the evaluation script: ***examples/mixtral/evaluate_mixtral_8x7b_ptd.sh***
+Configure the evaluation script: ***tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh***

 ```bash
 # Ascend-toolkit path
@ -311,7 +312,7 @@ TASK="mmlu"
 Start the evaluation

 ```bash
-bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
+bash tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh
 ```

 The evaluation results are as follows
--- a/examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
+++ b/examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
@ -1,5 +1,6 @@
 #!/bin/bash
 export ASCEND_LAUNCH_BLOCKING=1
+export WITHOUT_JIT_COMPILE=1
 export CUDA_DEVICE_MAX_CONNECTIONS=1

 GPUS_PER_NODE=8
@ -37,7 +38,7 @@ MOE_ARGS="
    --moe-router-load-balancing-type aux_loss \
    --moe-aux-loss-coeff 0.01 \
    --moe-train-capacity-factor 1.1 \
-    --noisy-gate-policy RSample
+    --noisy_gate_policy RSample
 "

 GPT_ARGS="
@ -58,7 +59,7 @@ GPT_ARGS="
    --seq-length 32768 \
    --max-position-embeddings 32768 \
    --micro-batch-size 1 \
-    --global-batch-size 16 \
+    --global-batch-size 8 \
    --make-vocab-size-divisible-by 1 \
    --lr 1.25e-6 \
    --train-iters 2000 \
--- a/examples/qwen/README.md
+++ b/examples/qwen/README.md
--- a/examples/qwen/README_en.md
+++ b/examples/qwen/README_en.md
@ -1,4 +1,4 @@
-# Qwen
+# Qwen  $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$ 
 <p align="left">
        <b><a href="README.md">简体中文</a></b> |
        <b>English</b> 
@ -48,10 +48,11 @@ Here's a hardware summary of pre-training  Qwen-7B:
    git clone https://gitee.com/ascend/ModelLink.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
-    git checkout core_r0.6.0
+    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -63,73 +64,73 @@ Here's a hardware summary of pre-training  Qwen-7B:
    # python3.8
    conda create -n test python=3.8
    conda activate test
-    
+
    # install torch and torch_npu
    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
+
    # install MindSpeed
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
-    git checkout 2b0edd2
+    git checkout 224ae35e8fc96778f957029d1371ddb623452a50
    pip install -r requirements.txt
    pip install -e .
    cd ..
-    
+
    # install other packages
    pip install -r requirements.txt
    ```
 3. Prepare pretrained weights and tokenizer
-    Download the Qwen-7B checkpoint from [here](https://huggingface.co/Qwen/Qwen-7B/tree/main)
+   Download the Qwen-7B checkpoint from [here](https://huggingface.co/Qwen/Qwen-7B/tree/main)

-    ```bash
-    mkdir ./model_from_hf/Qwen-7B/
-    cd ./model_from_hf/Qwen-7B/
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_256.cpp
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_kernel_256.cu
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/config.json
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/configuration_qwen.py
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cpp_kernels.py
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/generation_config.json
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00001-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00002-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00003-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00004-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00005-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00006-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00007-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00008-of-00008.safetensors
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model.safetensors.index.json
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/modeling_qwen.py
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen.tiktoken
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen_generation_utils.py
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenization_qwen.py
-    wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenizer_config.json
-    cd ../../
-    ```
+   ```bash
+   mkdir ./model_from_hf/Qwen-7B/
+   cd ./model_from_hf/Qwen-7B/
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_256.cpp
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_kernel_256.cu
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/config.json
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/configuration_qwen.py
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cpp_kernels.py
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/generation_config.json
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00001-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00002-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00003-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00004-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00005-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00006-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00007-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00008-of-00008.safetensors
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model.safetensors.index.json
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/modeling_qwen.py
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen.tiktoken
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen_generation_utils.py
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenization_qwen.py
+   wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenizer_config.json
+   cd ../../
+   ```

-    Modify line 39 in the modelling_qwen.py file, changing:
+   Modify line 39 in the modelling_qwen.py file, changing:

-    ```python
-    SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
-    ```
+   ```python
+   SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
+   ```

-    to：
+   to：

-    ```python
-    SUPPORT_FP16 = True
-    ```
+   ```python
+   SUPPORT_FP16 = True
+   ```
 4. Weights convert

-    Convert weights from huggingface format to megatron format
-    ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
+   Convert weights from huggingface format to megatron format
+   ***(This scenario is generally used to train open-source HuggingFace models on Megatron)***

    ```bash
    # modify the script according to your own ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader qwen_hf \
        --saver megatron \
@ -140,13 +141,13 @@ Here's a hardware summary of pre-training  Qwen-7B:
        --add-qkv-bias
    ```

-    Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
+   Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
    ***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***

    ```shell
    # Modify the ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader megatron \
        --saver megatron \
@ -157,7 +158,7 @@ Here's a hardware summary of pre-training  Qwen-7B:
        --add-qkv-bias \
        --save-dir ./model_from_hf/Qwen-7B/   # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-7B/mg2hg/
    ```
-5. Prepare dataset
+1. Prepare dataset

    Download the Qwen-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

@ -166,7 +167,7 @@ Here's a hardware summary of pre-training  Qwen-7B:
    cd ./dataset
    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
    cd ..
-    
+
    # process datasets  
    mkdir ./dataset/Qwen-7B/
    python ./tools/preprocess_data.py \
@ -177,28 +178,28 @@ Here's a hardware summary of pre-training  Qwen-7B:
        --seq-length 8192 \
        --workers 4 \
        --log-interval 1000
-    ```
-6. pre-training
+	```
+1. pre-training

-    Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
+   Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh

-    ```shell
+   ```shell
    # modify the script according to your own ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
+
    # modify config according to your own actual situation
    CKPT_SAVE_DIR="./ckpt/Qwen-7B/"
    TOKENIZER_MODEL="./model_from_hf/Qwen-7B/"  #tokenizer path
    DATA_PATH="./dataset/Qwen-7B/alpaca_text_document"  #processed dataset
    CKPT_LOAD_DIR="./model_weights/Qwen-7B-v0.1-tp8-pp1/"
-    ```
+   ```

-    Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
+   Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh

-    ```shell
+   ```shell
    bash examples/qwen/pretrain_qwen_7b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+   ```
+   **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -212,8 +213,7 @@ The performance of Qwen-7B in **Ascend NPU** and **Reference**:
 | Reference | Qwen-7B |             2867             |

 ## Inference
-
-Config qwen-7b inference script: examples/qwen/generate_qwen_7b_ptd.sh
+Config qwen-7b inference script: tasks/inference/generate_qwen_7b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -224,22 +224,19 @@ CHECKPOINT="./model_weights/Qwen-7B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Qwen-7B/"
 ```

-Launch qwen-7b inference script: examples/qwen/generate_qwen_7b_ptd.sh
-
+Launch qwen-7b inference script: tasks/inference/generate_qwen_7b_ptd.sh
 ```bash
-bash examples/qwen/generate_qwen_7b_ptd.sh
+bash tasks/inference/generate_qwen_7b_ptd.sh
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
-
 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_7b_inference.png)
+![Inference](../../sources/images/qwen/qwen_7b_inference.png)

 ## Evaluation

 We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.

-Config qwen-7b evaluation script: examples/qwen/evaluate_qwen_7b_ptd.sh
+Config qwen-7b evaluation script: tasks/evaluation/evaluate_qwen_7b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -257,7 +254,7 @@ TASK="mmlu"  # "ceval" for ceval task
 Launch qwen-7b evaluation

 ```bash
-bash examples/qwen/evaluate_qwen_7b_ptd.sh
+bash ./tasks/evaluation/evaluate_qwen_7b_ptd.sh
 ```

 | Task | Subset | Question | OpenSource | NPU |
@ -283,10 +280,11 @@ Here's a hardware summary of pre-training  Qwen-14B:
    git clone https://gitee.com/ascend/ModelLink.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
-    git checkout core_r0.6.0
+    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -298,20 +296,20 @@ Here's a hardware summary of pre-training  Qwen-14B:
    # python3.8
    conda create -n test python=3.8
    conda activate test
-    
+
    # install torch and torch_npu
    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
+
    # install MindSpeed
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
-    git checkout 2b0edd2
+    git checkout 224ae35e8fc96778f957029d1371ddb623452a50
    pip install -r requirements.txt
    pip install -e .
    cd ..
-    
+
    # install other packages
    pip install -r requirements.txt
    ```
@ -370,8 +368,8 @@ Here's a hardware summary of pre-training  Qwen-14B:
    ```bash
    # modify the script according to your own ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-   
-    python tools/checkpoint/convert_ckpt.py \
+    
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader qwen_hf \
        --saver megatron \
@ -388,7 +386,7 @@ Here's a hardware summary of pre-training  Qwen-14B:
    ```shell
    # Modify the ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader megatron \
        --saver megatron \
@ -399,7 +397,7 @@ Here's a hardware summary of pre-training  Qwen-14B:
        --add-qkv-bias \
        --save-dir ./model_from_hf/Qwen-14B/   # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-14B/mg2hg/
    ```
-5. Prepare dataset
+1. Prepare dataset

    Download the Qwen-14B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

@ -408,7 +406,7 @@ Here's a hardware summary of pre-training  Qwen-14B:
    cd ./dataset
    wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
    cd ..
-   
+
    # process datasets  
    mkdir ./dataset/Qwen-14B/
    python ./tools/preprocess_data.py \
@ -419,15 +417,15 @@ Here's a hardware summary of pre-training  Qwen-14B:
        --seq-length 2048 \
        --workers 4 \
        --log-interval 1000
-   ```
-6. pre-training
+	```
+1. pre-training

   Config Qwen-14B pre-training script: examples/qwen/pretrain_qwen_14b_ptd.sh

   ```shell
    # modify the script according to your own ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-   
+
    # modify config according to your own actual situation
    CKPT_SAVE_DIR="./ckpt/Qwen-14B/"
    TOKENIZER_MODEL="./model_from_hf/Qwen-14B/"  #tokenizer path
@ -440,7 +438,7 @@ Here's a hardware summary of pre-training  Qwen-14B:
   ```shell
    bash examples/qwen/pretrain_qwen_14b_ptd.sh 
   ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+
 ### Performance

 #### Machine performance
@ -454,7 +452,7 @@ The performance of Qwen-14B in **Ascend NPU** and **Reference**:

 ## Inference

-Config qwen-14b inference script: examples/qwen/generate_qwen_14b_ptd.sh
+Config qwen-14b inference script: tasks/inference/generate_qwen_14b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -465,21 +463,20 @@ CHECKPOINT="./model_weights/Qwen-14B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Qwen-14B/"
 ```

-Launch qwen-14b inference script: examples/qwen/generate_qwen_14b_ptd.sh
-
+Launch qwen-14b inference script: tasks/inference/generate_qwen_14b_ptd.sh
 ```bash
-bash examples/qwen/generate_qwen_7b_ptd.sh
+bash tasks/inference/generate_qwen_7b_ptd.sh
 ```

 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_14b_inference.png)
+![Inference](../../sources/images/qwen/qwen_14b_inference.png)


 ## Evaluation

 We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model. 

-Config qwen-14b evaluation script: examples/qwen/evaluate_qwen_14b_ptd.sh
+Config qwen-14b evaluation script: tasks/evaluation/evaluate_qwen_14b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -497,7 +494,7 @@ TASK="mmlu"  # "ceval" for ceval task
 Launch qwen-14b evaluation

 ```bash
-bash examples/qwen/evaluate_qwen_14b_ptd.sh
+bash ./tasks/evaluation/evaluate_qwen_14b_ptd.sh
 ```

 | Task | Subset | Question | OpenSource | NPU |
@ -524,10 +521,11 @@ Here's a hardware summary of pre-training  Qwen-72B:
    git clone https://gitee.com/ascend/ModelLink.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
-    git checkout core_r0.6.0
+    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -539,20 +537,20 @@ Here's a hardware summary of pre-training  Qwen-72B:
    # python3.8
    conda create -n test python=3.8
    conda activate test
-    
+
    # install torch and torch_npu
    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
-    
+
    # install MindSpeed
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
-    git checkout 2b0edd2
+    git checkout 224ae35e8fc96778f957029d1371ddb623452a50
    pip install -r requirements.txt
    pip install -e .
    cd ..
-    
+
    # install other packages
    pip install -r requirements.txt
    ```
@ -592,8 +590,8 @@ Here's a hardware summary of pre-training  Qwen-72B:
    ```bash
    # modify the script according to your own ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-   
-    python tools/checkpoint/convert_ckpt.py \
+    
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader qwen_hf \
        --saver megatron \
@ -610,7 +608,7 @@ Here's a hardware summary of pre-training  Qwen-72B:
    ```shell
    # Modify the ascend-toolkit path
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    python tools/checkpoint/convert_ckpt.py \
+    python tools/checkpoint/util.py \
        --model-type GPT \
        --loader megatron \
        --saver megatron \
@ -621,7 +619,8 @@ Here's a hardware summary of pre-training  Qwen-72B:
        --add-qkv-bias \
        --save-dir ./model_from_hf/Qwen-72B/    # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-72B/mg2hg/
    ```
-5. Prepare dataset
+
+1. Prepare dataset

    Download the Qwen-72B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)

@ -648,31 +647,28 @@ Here's a hardware summary of pre-training  Qwen-72B:
    Config Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh

    ```shell
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-    
-    # modify config according to your own actual situation
-    CKPT_SAVE_DIR="./ckpt/Qwen-72B/"
-    TOKENIZER_MODEL="./model_from_hf/Qwen-72B/"  #tokenizer path
-    DATA_PATH="./dataset/Qwen-72B/alpaca_text_document"  #processed dataset
-    CKPT_LOAD_DIR="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
+        # modify the script according to your own ascend-toolkit path
+        source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+
+        # modify config according to your own actual situation
+        CKPT_SAVE_DIR="./ckpt/Qwen-72B/"
+        TOKENIZER_MODEL="./model_from_hf/Qwen-72B/"  #tokenizer path
+        DATA_PATH="./dataset/Qwen-72B/alpaca_text_document"  #processed dataset
+        CKPT_LOAD_DIR="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
    ```

    To use a 32K sequence, turn on the re-computation feature and change the value of seq-length to 32768. The parameter configuration is as follows:

-    ```shell
-        --seq-length 32768 \
-        --recompute-granularity full \
-        --recompute-method block \
-        --recompute-num-layers 80 \
-    ```
+    --recompute-granularity full \
+    --recompute-method block \
+    --recompute-num-layers 80 \
+   ```

-    Launch Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
-
-    ```shell
+	Launch Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
+   
+   ```shell
    bash examples/qwen/pretrain_qwen_72b_ptd.sh 
-    ```
-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
+   ```

 ### Performance

@ -687,8 +683,7 @@ The performance of Qwen-72B in **Ascend NPU** and **Reference**:


 ## Inference
-
-Config qwen-72b inference script: examples/qwen/generate_qwen_72b_ptd.sh
+Config qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -699,20 +694,19 @@ CHECKPOINT="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
 TOKENIZER_PATH="./model_from_hf/Qwen-72B/"
 ```

-Launch qwen-72b inference script: examples/qwen/generate_qwen_72b_ptd.sh
-
+Launch qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh
 ```bash
-bash examples/qwen/generate_qwen_72b_ptd.sh
+bash tasks/inference/generate_qwen_72b_ptd.sh
 ```

 Some inference samples are as follows:
-![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_72b_inference.png)
+![Inference](../../sources/images/qwen/qwen_72b_inference.png)

 ## Evaluation

 We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model. 

-Config qwen-72b evaluation script: examples/qwen/evaluate_qwen_72b_ptd.sh
+Config qwen-72b evaluation script: tasks/evaluation/evaluate_qwen_72b_ptd.sh

 ```bash
 # ascend-toolkit path
@ -730,7 +724,7 @@ TASK="mmlu"  # "ceval" for ceval task
 Launch qwen-72b evaluation

 ```bash
-bash examples/qwen/evaluate_qwen_72b_ptd.sh
+bash ./tasks/evaluation/evaluate_qwen_72b_ptd.sh
 ```

 | Task | Subset | Question | OpenSource | NPU |
--- a/examples/qwen/pretrain_qwen_14b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_14b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 NPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -88,6 +89,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $OUTPUT_ARGS \
    --tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${CKPT_SAVE_DIR} \
    | tee logs/train_qwen_14b.log
--- a/examples/qwen/pretrain_qwen_72b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_72b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 NPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -89,6 +90,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $OUTPUT_ARGS \
    --tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${CKPT_SAVE_DIR} \
    | tee logs/train_qwen_72b.log
--- a/examples/qwen/pretrain_qwen_7b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_7b_ptd.sh
@ -1,6 +1,7 @@
 #!/bin/bash

 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0

 NPUS_PER_NODE=8
 MASTER_ADDR=localhost
@ -88,6 +89,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $OUTPUT_ARGS \
    --tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
    --distributed-backend nccl \
-    --jit-compile \
    --save ${CKPT_SAVE_DIR} \
    | tee logs/train_qwen_7b.log
--- a/examples/qwen15/README.md
+++ b/examples/qwen15/README.md
--- a/examples/qwen15/README_en.md
+++ b/examples/qwen15/README_en.md
--- a/examples/qwen15/evaluate_qwen15_0point5b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_0point5b_ptd.sh
@ -1,67 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=1
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-# please fill these path configurations
-CHECKPOINT="your model ckpt path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="your data path"
-TASK="mmlu"
-
-TP=1
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-       --task-data-path $DATA_PATH \
-       --task ${TASK} \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --max-position-embeddings 8192 \
-       --num-layers 24  \
-       --hidden-size 1024  \
-       --ffn-hidden-size 2816 \
-       --num-attention-heads 16  \
-       --disable-bias-linear \
-       --swiglu \
-       --position-embedding-type rope \
-       --load $CHECKPOINT \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --micro-batch-size 1  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --add-qkv-bias \
-       --make-vocab-size-divisible-by 1 \
-       --padded-vocab-size 151936 \
-       --rotary-base 1000000 \
-       --no-gradient-accumulation-fusion \
-       --attention-softmax-in-fp32 \
-       --seed 42 \
-       --no-chat-template \
-       | tee logs/eval_qwen15_0point5b_${TASK}.log
--- a/examples/qwen15/evaluate_qwen15_14b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_14b_ptd.sh
@ -1,69 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=8
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-# please fill these path configurations
-CHECKPOINT="your model ckpt path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="your data path"
-TASK="mmlu"
-
-TP=8
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-       --task-data-path $DATA_PATH \
-       --task ${TASK} \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --max-position-embeddings 8192 \
-       --num-layers 40  \
-       --hidden-size 5120  \
-       --ffn-hidden-size 13696 \
-       --num-attention-heads 40  \
-       --disable-bias-linear \
-       --swiglu \
-       --position-embedding-type rope \
-       --load $CHECKPOINT \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --micro-batch-size 1  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --untie-embeddings-and-output-weights \
-       --add-qkv-bias \
-       --make-vocab-size-divisible-by 16 \
-       --padded-vocab-size 152064 \
-       --rotary-base 1000000 \
-       --no-gradient-accumulation-fusion \
-       --attention-softmax-in-fp32 \
-       --seed 42 \
-       --bf16 \
-       --no-chat-template \
-       | tee logs/eval_qwen15_14b_${TASK}.log
--- a/examples/qwen15/evaluate_qwen15_1point8b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_1point8b_ptd.sh
@ -1,69 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=1
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-# please fill these path configurations
-CHECKPOINT="your model ckpt path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="your data path"
-TASK="mmlu"
-
-TP=1
-PP=1
-
-DISTRIBUTED_ARGS="
-    --nproc_per_node $NPUS_PER_NODE \
-    --nnodes $NNODES \
-    --node_rank $NODE_RANK \
-    --master_addr $MASTER_ADDR \
-    --master_port $MASTER_PORT
-"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-       --task-data-path $DATA_PATH \
-       --task ${TASK} \
-       --tensor-model-parallel-size ${TP} \
-       --pipeline-model-parallel-size ${PP} \
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --max-position-embeddings 8192 \
-       --num-layers 24  \
-       --hidden-size 2048  \
-       --ffn-hidden-size 5504 \
-       --num-attention-heads 16  \
-       --disable-bias-linear \
-       --swiglu \
-       --position-embedding-type rope \
-       --load $CHECKPOINT \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --micro-batch-size 1  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --untie-embeddings-and-output-weights \
-       --add-qkv-bias \
-       --make-vocab-size-divisible-by 1 \
-       --padded-vocab-size 151936 \
-       --rotary-base 1000000 \
-       --no-gradient-accumulation-fusion \
-       --attention-softmax-in-fp32 \
-       --seed 42 \
-       --bf16 \
-       --no-chat-template \
-       | tee logs/eval_qwen15_1point8b_${TASK}.log
--- a/examples/qwen15/evaluate_qwen15_32b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_32b_ptd.sh
@ -1,60 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export  LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1800
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-NPU_PER_NODE=8
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-
-WORLD_SIZE=$(($NPU_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPU_PER_NODE  --nnodes $NNODES --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="./mmlu/data/test"
-TASK="mmlu"
-
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-       --task-data-path $DATA_PATH \
-       --task $TASK \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 64 \
-       --hidden-size 5120 \
-       --num-attention-heads 40 \
-       --ffn-hidden-size 27392 \
-       --max-position-embeddings 8192 \
-       --seq-length 8192 \
-       --padded-vocab-size 152064 \
-       --rotary-base 1000000 \
-       --make-vocab-size-divisible-by 1 \
-       --untie-embeddings-and-output-weights \
-       --micro-batch-size 1 \
-       --swiglu \
-       --disable-bias-linear \
-       --add-qkv-bias \
-       --tokenizer-type PretrainedFromHF \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --load ${CHECKPOINT} \
-       --normalization RMSNorm \
-       --position-embedding-type rope \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --tokenizer-not-use-fast \
-       --max-new-tokens 1 \
-       --bf16 \
-       --group-query-attention \
-       --num-query-groups 8 \
-       --no-chat-template \
-       --seed 42 \
-       | tee logs/eval_qwen15_32b_${TASK}.log
--- a/examples/qwen15/evaluate_qwen15_4b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_4b_ptd.sh
@ -1,57 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1200
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-NPUS_PER_NODE=2
-
-WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="your model ckpt path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="your data path"
-TASK="mmlu"
-
-# Different task needs different max_new_tokens value, please follow the instruction in readme.
-torchrun $DISTRIBUTED_ARGS evaluation.py   \
-       --task-data-path $DATA_PATH \
-       --task ${TASK}\
-       --seq-length 8192 \
-       --max-new-tokens 1 \
-       --max-position-embeddings 8192 \
-       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 2  \
-       --num-layers 40  \
-       --hidden-size 2560  \
-       --ffn-hidden-size 6912 \
-       --num-attention-heads 20  \
-       --disable-bias-linear \
-       --swiglu \
-       --position-embedding-type rope \
-       --load $CHECKPOINT \
-       --normalization RMSNorm \
-       --tokenizer-type PretrainedFromHF  \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --tokenizer-not-use-fast \
-       --bf16  \
-       --micro-batch-size 1  \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --untie-embeddings-and-output-weights \
-       --add-qkv-bias \
-       --make-vocab-size-divisible-by 1 \
-       --seed 42 \
-       --rotary-base 5000000 \
-       --no-chat-template \
-       --padded-vocab-size 151936 | tee ./logs/eval_qwen15_4b_${TASK}.log
--- a/examples/qwen15/evaluate_qwen15_72b_ptd.sh
+++ b/examples/qwen15/evaluate_qwen15_72b_ptd.sh
@ -1,58 +0,0 @@
-#!/bin/bash
-
-# The number of parameters is not aligned
-export  LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
-export HCCL_CONNECT_TIMEOUT=1800
-export COMBINED_ENABLE=1
-export CUDA_DEVICE_MAX_CONNECTIONS=1
-
-# Change for multinode config
-MASTER_ADDR=localhost
-NPU_PER_NODE=8
-MASTER_PORT=6001
-NNODES=1
-NODE_RANK=0
-
-WORLD_SIZE=$(($NPU_PER_NODE*$NNODES))
-
-DISTRIBUTED_ARGS="--nproc_per_node $NPU_PER_NODE  --nnodes $NNODES --node_rank $NODE_RANK \
-                  --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
-
-CHECKPOINT="your model directory path"
-TOKENIZER_PATH="your tokenizer path"
-DATA_PATH="./mmlu/data/test"
-TASK="mmlu"
-
-torchrun $DISTRIBUTED_ARGS evaluation.py \
-       --task-data-path $DATA_PATH \
-       --task $TASK \
-       --tensor-model-parallel-size 8 \
-       --pipeline-model-parallel-size 1 \
-       --num-layers 64 \
-       --hidden-size 8192 \
-       --num-attention-heads 64 \
-       --ffn-hidden-size 24576 \
-       --max-position-embeddings 8192 \
-       --seq-length 8192 \
-       --padded-vocab-size 152064 \
-       --rotary-base 1000000 \
-       --make-vocab-size-divisible-by 1 \
-       --untie-embeddings-and-output-weights \
-       --micro-batch-size 1 \
-       --swiglu \
-       --disable-bias-linear \
-       --add-qkv-bias \
-       --tokenizer-type PretrainedFromHF \
-       --tokenizer-name-or-path ${TOKENIZER_PATH} \
-       --load ${CHECKPOINT} \
-       --normalization RMSNorm \
-       --position-embedding-type rope \
-       --exit-on-missing-checkpoint \
-       --no-load-rng \
-       --no-load-optim \
-       --tokenizer-not-use-fast \
-       --max-new-tokens 1 \
-       --bf16 \
-       --no-chat-template \
-       --seed 42 \
-       | tee logs/eval_qwen15_72b_${TASK}.log
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
WuSheYu03	65e60561fe	测试PR2024 Signed-off-by: WuSheYu03 <2893251844@qq.com>	2024-07-04 12:13:39 +00:00
SheYuWu03	3c86007ab1	测试PR	2024-07-03 19:38:11 +08:00
fengliangjun	27485bba55	!1376 将分支名从1.0.0改为1.0 Merge pull request !1376 from fengliangjun/1.0.0	2024-06-24 13:29:15 +00:00
glhyy	c9fa42a3b3	!1362 增加预训练非共享储存情况下非主节点数据缓存检测和生成 Merge pull request !1362 from glhyy/1.0.0	2024-06-24 07:11:13 +00:00
sunjunjie	8f4f1079c9	!1272 修改AscendSpeed为MindSpeed Merge pull request !1272 from sunjunjie/1.0.0	2024-06-03 02:13:09 +00:00
fengliangjun	5e4267a1e6	!1301 Mixtral-Moe模型更新至32K Merge pull request !1301 from fengliangjun/1.0.0	2024-05-23 01:09:13 +00:00
guoxinjie	9dba65cfcb	!1277 订正性能 Merge pull request !1277 from guoxinjie/1.0.0	2024-05-13 03:43:01 +00:00
wucong	331ce89ed1	!1266 对于已经测试过的模型打【昇腾贡献模型】标签 Merge pull request !1266 from wucong/dev1v1	2024-05-07 02:21:14 +00:00
黄宇豪	0b4fb8b645	!1245 fix: 修复 1.0.0 llama2-7B README 评估脚本路径 Merge pull request !1245 from 黄宇豪/1.0.0	2024-05-07 01:44:35 +00:00
liuyanghan	4ad901b67d	!1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率 Merge pull request !1250 from liuyanghan/1.0.0	2024-04-29 01:34:46 +00:00
glhyy	f33da62b90	!1234 README已知问题更新（1.0.0） Merge pull request !1234 from glhyy/1.0.0	2024-04-16 02:22:42 +00:00
guoxinjie	4338fa3467	!1226 Q1 分支移除 megatron Merge pull request !1226 from guoxinjie/24Q1	2024-04-15 02:07:25 +00:00
guoxinjie	2cdee76040	!1221 Q1 分支指定加速库 commit ID Merge pull request !1221 from guoxinjie/24Q1	2024-04-07 06:47:28 +00:00
黄宇豪	3a544f320b	!1216 统一权重路径和README样式 Merge pull request !1186 from 黄宇豪/master Merge pull request !1216 from 黄宇豪/24Q1	2024-04-02 07:33:15 +00:00
glhyy	8fe29aba96	!1208 README中baichuan2链接修正 Merge pull request !1208 from glhyy/24Q1	2024-04-01 08:42:58 +00:00
guoxinjie	da5a204b70	!1203 修复分布式优化器权重保存bug Merge pull request !1203 from guoxinjie/master	2024-04-01 02:26:10 +00:00
xiongliangcheng	56a707dd3c	!1156 修改readme中的错误 Merge pull request !1156 from xiongliangcheng/master	2024-03-28 01:15:04 +00:00
liuyanghan	a68fac8224	!1179 多机训练下，数据加载问题说明 Merge pull request !1179 from liuyanghan/24Q1	2024-03-28 01:06:53 +00:00
shengjy	ebe2f95e27	!1165 llama2 readme修改，更正tokenizer说明 * llama2 readme update	2024-03-27 08:25:18 +00:00