!1362 增加预训练非共享储存情况下非主节点数据缓存检测和生成

Merge pull request !1362 from glhyy/1.0.0
2024-06-24 07:11:13 +00:00 · 2024-06-24 07:11:13 +00:00 · c9fa42a3b3
parent 8f4f1079c9
commit c9fa42a3b3
27 changed files with 440 additions and 37 deletions
--- a/examples/aquila/README.md
+++ b/examples/aquila/README.md
@ -34,6 +34,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -164,7 +165,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 bash examples/aquila/pretrain_aquila_7b_ptd.sh
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/aquila/README_en.md
+++ b/examples/aquila/README_en.md
@ -34,6 +34,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -159,7 +160,7 @@ Start pre-training Aquila-7B model:
 bash examples/aquila/pretrain_aquila_7b_ptd.sh
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -48,6 +48,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
 bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -276,6 +277,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -414,7 +416,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
 bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/baichuan/README_en.md
+++ b/examples/baichuan/README_en.md
@ -48,6 +48,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
 bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.



@ -278,6 +279,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -418,7 +420,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
 bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

--- a/examples/baichuan2/README.md
+++ b/examples/baichuan2/README.md
@ -45,6 +45,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
 bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -275,6 +276,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -407,7 +409,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
 bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/baichuan2/README_en.md
+++ b/examples/baichuan2/README_en.md
@ -45,6 +45,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
 bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -272,6 +273,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -403,7 +405,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
 bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

--- a/examples/bloom/README.md
+++ b/examples/bloom/README.md
@ -30,6 +30,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -152,7 +153,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
 bash examples/bloom/pretrain_bloom_ptd_7B.sh 
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -242,6 +243,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -368,7 +370,7 @@ DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
 bash examples/bloom/pretrain_bloom_176b.sh
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ## 性能

--- a/examples/bloom/README_en.md
+++ b/examples/bloom/README_en.md
@ -30,6 +30,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -151,7 +152,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
 bash examples/bloom/pretrain_bloom_ptd_7B.sh 
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.


 ### Performance
@ -241,6 +242,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -369,7 +371,7 @@ Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
 bash examples/bloom/pretrain_bloom_176b.sh
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ## Performance

--- a/examples/intern/README.md
+++ b/examples/intern/README.md
@ -43,6 +43,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -171,7 +172,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
 bash examples/intern/pretrain_internlm_7b_ptd.sh 
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。


 ### 性能
@ -235,6 +236,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -323,7 +325,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
 bash examples/intern/pretrain_internlm_65b_ptd.sh 
 ```

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/intern/README_en.md
+++ b/examples/intern/README_en.md
@ -44,6 +44,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -172,7 +173,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
 bash examples/intern/pretrain_internlm_7b_ptd.sh 
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -234,6 +235,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -322,7 +324,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document"  #processed dataset
 bash examples/intern/pretrain_internlm_65b_ptd.sh 
 ```

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.


 ### Performance
--- a/examples/llama/README.md
+++ b/examples/llama/README.md
@ -46,6 +46,7 @@ LLaMA-7B/13B 训练的硬件配置如下:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -242,7 +243,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"

 5.3 启动 LLaMA-7B/13B 预训练脚本

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 LLaMA-7B

@ -482,6 +483,7 @@ LLaMA-33B/65B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -677,7 +679,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"

 5.3 启动预训练脚本:

-**注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+**注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 启动 llama-33B 预训练脚本 : ./examples/llama/pretrain_llama_33B_ptd_32p.sh

--- a/examples/llama/README_en.md
+++ b/examples/llama/README_en.md
@ -45,6 +45,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -237,7 +238,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"

 5.3 Launch LLaMA-7B/13B pre-training script.

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 LLaMA-7B

@ -466,6 +467,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -659,7 +661,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"

 5.3 Launch  pre-training script:

-**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 Launch llama-33B pre-training script : ModelLink/examples/llama/pretrain_llama_33B_ptd_32p.sh

--- a/examples/llama2/README.md
+++ b/examples/llama2/README.md
@ -54,6 +54,7 @@ LLAMA2-7B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -183,7 +184,7 @@ python tools/checkpoint/util.py
   ```shell
    bash examples/llama2/pretrain_llama2_7b_ptd.sh
   ```
-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
 6. 微调

   6.1 准备微调数据集
@ -406,6 +407,7 @@ LLaMA2-13B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink 
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -530,7 +532,7 @@ python tools/checkpoint/util.py \
    bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
 ```

-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
 6. 微调

   6.1 准备微调数据集
@ -696,6 +698,7 @@ LLaMA2-34B/70B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -957,7 +960,7 @@ python tools/checkpoint/util.py \
   bash examples/llama2/pretrain_llama2_70b_ptd.sh
   ```

-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。
 6. 微调

   6.1 准备微调数据集
--- a/examples/llama2/README_en.md
+++ b/examples/llama2/README_en.md
@ -52,6 +52,7 @@ Here's a hardware summary of pre-training  LLAMA2-7B:
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -194,7 +195,7 @@ Here's a hardware summary of pre-training  LLAMA2-7B:
    bash examples/llama2/pretrain_llama2_7b_ptd.sh 
   ```

-    **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.   
+    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.   

 6. fine-tuning

@ -415,7 +416,8 @@ Here's a hardware summary of pre-training  LLaMA2-13B:
    git checkout -f bcce6f
    cp -r megatron ../ModelLink/
    cd ..
-    cd ModelLink 
+    cd ModelLink
+    git checkout 1.0.0	
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -559,7 +561,7 @@ Here's a hardware summary of pre-training  LLaMA2-13B:
   ```shell
    bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
   ```
-   **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+   **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

   ```shell
   # download datasets
@ -746,6 +748,7 @@ git checkout -f bcce6f
 cp -r megatron ../ModelLink/
 cd ..
 cd ModelLink
+git checkout 1.0.0
 mkdir logs
 mkdir model_from_hf
 mkdir dataset
@ -990,7 +993,7 @@ pip install -r requirements.txt

    Launch pre-training script

-    **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
    
    LLaMA2-34B: examples/llama2/pretrain_llama2_34B_ptd_16p.sh

--- a/examples/mixtral/README.md
+++ b/examples/mixtral/README.md
@ -46,6 +46,7 @@
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -200,7 +201,7 @@
    bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
    ```

-    **注意**：如果使用多机训练，且没有设置数据共享，需要在各节点同步首节点数据。
+    **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 2. 微调

--- a/examples/mixtral/README_en.md
+++ b/examples/mixtral/README_en.md
@ -46,6 +46,7 @@ Recommended hardware configuration for inference:
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -201,7 +202,7 @@ Recommended hardware configuration for inference:
    bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
    ```

-    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to copy the generated data from the first node to others.
+    **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 2. Fine-Tuning

--- a/examples/qwen/README.md
+++ b/examples/qwen/README.md
@ -54,6 +54,7 @@ Qwen-7B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -202,7 +203,7 @@ Qwen-7B 训练的硬件配置:
    bash examples/qwen/pretrain_qwen_7b_ptd.sh
 ```

-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -289,6 +290,7 @@ Qwen-14B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -442,7 +444,7 @@ Qwen-14B 训练的硬件配置:
    bash examples/qwen/pretrain_qwen_14b_ptd.sh
   ```

-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

@ -529,6 +531,7 @@ Qwen-72B 训练的硬件配置:
   cp -r megatron ../ModelLink/
   cd ..
   cd ModelLink
+   git checkout 1.0.0
   mkdir logs
   mkdir model_from_hf
   mkdir dataset
@ -673,7 +676,7 @@ Qwen-72B 训练的硬件配置:
    bash examples/qwen/pretrain_qwen_72b_ptd.sh
   ```

-   **注意**：如果使用多机训练，需要设置多机数据共享，非主节点通过数据共享读取主节点数据。或者，直接将主节点生成的数据复制到非主节点。
+   **注意**：如果使用多机训练，且没有设置数据共享，需要在训练启动脚本中增加`--no-shared-storage`参数，设置此参数之后将会根据分布式参数判断非主节点是否需要load数据，并检查相应缓存和生成数据。

 ### 性能

--- a/examples/qwen/README_en.md
+++ b/examples/qwen/README_en.md
@ -52,6 +52,7 @@ Here's a hardware summary of pre-training  Qwen-7B:
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -198,7 +199,7 @@ Here's a hardware summary of pre-training  Qwen-7B:
   ```shell
    bash examples/qwen/pretrain_qwen_7b_ptd.sh 
   ```
-   **Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
+   **Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

 ### Performance

@ -283,6 +284,7 @@ Here's a hardware summary of pre-training  Qwen-14B:
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
@ -523,6 +525,7 @@ Here's a hardware summary of pre-training  Qwen-72B:
    cp -r megatron ../ModelLink/
    cd ..
    cd ModelLink
+	git checkout 1.0.0
    mkdir logs
    mkdir model_from_hf
    mkdir dataset
--- a/modellink/arguments.py
+++ b/modellink/arguments.py
@ -43,6 +43,7 @@ def process_args(parser):
    parser = _add_data_args(parser)
    parser = _add_moe_args(parser)
    parser = _add_num_layer_allocation(parser)
+    parser = _add_dataset_args(parser)
    return parser


@ -122,3 +123,13 @@ def _add_network_size_args(parser):
                       help='set padded vocab size'
                       )
    return parser
+
+
+def _add_dataset_args(parser):
+    group = parser.add_argument_group(title='dataset_args')
+    group.add_argument('--no-shared-storage',
+                       action='store_true',
+                       default=False,
+                       help='if no shared storage, set it'
+                       )
+    return parser
--- a/modellink/core/init.py
+++ b/modellink/core/init.py
@ -19,3 +19,5 @@ from .parallel_state import (initialize_model_parallel_decorator, destroy_model_
                             get_expert_model_parallel_world_size, get_expert_parallel_group,
                             get_expert_parallel_rank, get_expert_parallel_world_size,
                             set_expert_model_parallel_world_size, set_expert_model_parallel_rank)
+from .datasets.blended_megatron_dataset_builder import _build_generic_dataset
+from .datasets.gpt_dataset import _build_document_sample_shuffle_indices
--- a/modellink/core/datasets/init.py
+++ b/modellink/core/datasets/init.py
--- a/modellink/core/datasets/blended_megatron_dataset_builder.py
+++ b/modellink/core/datasets/blended_megatron_dataset_builder.py
@ -0,0 +1,82 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+import logging
+from typing import Any, Optional, Type, Union
+
+import torch
+
+from megatron import get_args
+from megatron.core.datasets.blended_dataset import BlendedDataset
+from megatron.core.datasets.indexed_dataset import MMapIndexedDataset
+from megatron.core.datasets.megatron_dataset import MegatronDataset
+
+logger = logging.getLogger(__name__)
+
+DistributedDataset = Union[BlendedDataset, MegatronDataset, MMapIndexedDataset]
+
+from ..parallel_state import get_pipeline_model_parallel_node_info
+
+
+def need_to_build_dataset():
+    args = get_args()
+    share_save = not args.no_shared_storage
+    rank = torch.distributed.get_rank()
+    if share_save:
+        return rank == 0    
+    gpus_per_node = torch.cuda.device_count()
+    node_pp_group_info = get_pipeline_model_parallel_node_info()
+    flag = False
+    num_edge_ranks = sum([x != 1 for x in node_pp_group_info])
+    if num_edge_ranks >= 1:
+        first_idx = node_pp_group_info.index([x for x in node_pp_group_info if x != 1][0])
+        flag = (first_idx == rank % gpus_per_node)
+    return flag
+
+
+def _build_generic_dataset(
+    self, cls: Type[DistributedDataset], *args: Any,
+) -> Optional[DistributedDataset]:
+    """Build the DistributedDataset
+
+    Return None if and only if the underlying MegatronDataset class is not built on the current
+    rank and torch.distributed is initialized.
+
+    Args:
+        cls (Type[DistributedDataset]): The DistributedDataset class to be built
+
+        args (Tuple[Any]): The positional arguments used to build the provided
+        DistributedDataset class
+
+    Raises:
+        Exception: When the dataset constructor raises an OSError
+
+    Returns:
+        Optional[DistributedDataset]: The DistributedDataset instantion or None
+    """
+    if torch.distributed.is_initialized():
+
+        dataset = None
+
+        # First, build on ranks of first and last of pp group
+        to_build_flag = need_to_build_dataset()
+        if to_build_flag and getattr(self.config, "is_built_on_rank")():
+            try:
+                dataset = cls(*args)
+            except OSError as err:
+                log = (
+                    f"Failed to write dataset materials to the data cache directory. "
+                    + f"Please supply a directory to which you have write access via "
+                    + f"the path_to_cache attribute in BlendedMegatronDatasetConfig and "
+                    + f"retry. Refer to the preserved traceback above for more information."
+                )
+                raise Exception(log) from err
+
+        torch.distributed.barrier()
+
+        # After, build on other ranks
+        if not to_build_flag and getattr(self.config, "is_built_on_rank")():
+            dataset = cls(*args)
+
+        return dataset
+
+    return cls(*args)
--- a/modellink/core/datasets/gpt_dataset.py
+++ b/modellink/core/datasets/gpt_dataset.py
@ -0,0 +1,233 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+import logging
+import os
+import time
+from typing import Tuple
+
+import numpy
+
+from megatron.core.datasets.utils import log_single_rank
+from megatron.core.datasets.gpt_dataset import (_get_num_tokens_per_epoch,
+                                                _get_num_epochs,
+                                                _build_document_index,
+                                                _build_shuffle_index
+                                                )
+from modellink.error_utils import GPTDatasetSampleIndexError
+from .blended_megatron_dataset_builder import need_to_build_dataset
+
+logger = logging.getLogger(__name__)
+
+
+def _build_document_sample_shuffle_indices(
+    self,
+) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:
+    """Build the document index, the sample index, and the shuffle index
+    
+    The document index:
+        -- 1-D
+        -- An ordered array of document ids
+
+    The sample index:
+        -- 2-D
+        -- The document indices and offsets which mark the start of every sample
+
+    The shuffle index:
+        -- 1-D
+        -- A random permutation of index range of the sample index
+
+    Returns:
+        Tuple[numpy.ndarray, numpy.ndarray]: The document index, the sample index, and the
+        shuffle index
+
+    TODO: Explain the 80% threshold
+    """
+    path_to_cache = getattr(self.config, "path_to_cache")
+    if path_to_cache is None:
+        path_to_cache = os.path.join(
+            self.indexed_dataset.path_prefix, "cache", f"{type(self).__name__}_indices"
+        )
+
+    get_path_to = lambda suffix: os.path.join(
+        path_to_cache, f"{self.unique_description_hash}-{type(self).__name__}-{suffix}"
+    )
+    path_to_description = get_path_to("description.txt")
+    path_to_document_index = get_path_to("document_index.npy")
+    path_to_sample_index = get_path_to("sample_index.npy")
+    path_to_shuffle_index = get_path_to("shuffle_index.npy")
+    cache_hit = all(
+        map(
+            os.path.isfile,
+            [
+                path_to_description,
+                path_to_document_index,
+                path_to_sample_index,
+                path_to_shuffle_index,
+            ],
+        )
+    )
+
+    num_tokens_per_epoch = _get_num_tokens_per_epoch(self.indexed_dataset, self.indexed_indices)
+
+    sequence_length = getattr(self.config, "sequence_length")
+
+    num_epochs = _get_num_epochs(num_tokens_per_epoch, sequence_length, self.num_samples)
+
+    # When the rank on the first or last stage of the pipeline_model_parallel_group,
+    # it need to build dataset
+    if not cache_hit and need_to_build_dataset():
+        log_single_rank(
+            logger,
+            logging.INFO,
+            f"Build and save the {type(self).__name__} {self.index_split.name} indices",
+        )
+
+        if num_epochs == 1:
+            separate_final_epoch = False
+        else:
+            # Get the number of samples for the last epoch
+            num_samples_sans_final_epoch = (
+                (num_epochs - 1) * num_tokens_per_epoch - 1
+            ) // sequence_length
+            num_samples_from_final_epoch = self.num_samples - num_samples_sans_final_epoch
+            num_samples_per_epoch = (num_tokens_per_epoch - 1) // sequence_length
+
+            # num_samples_from_final_epoch should be non-negative
+            assert num_samples_from_final_epoch >= 0
+
+            # num_samples_from_final_epoch should not exceed max value
+            assert num_samples_from_final_epoch <= num_samples_per_epoch + 1
+
+            # Separate the final epoch if it falls below the threshold
+            threshold = 0.80
+            separate_final_epoch = num_samples_from_final_epoch < int(
+                threshold * num_samples_per_epoch
+            )
+
+            log_single_rank(
+                logger,
+                logging.DEBUG,
+                f"> num_samples_from_final_epoch: {num_samples_from_final_epoch}",
+            )
+            log_single_rank(logger, logging.DEBUG, f"> threshold: {threshold}")
+            log_single_rank(
+                logger, logging.DEBUG, f"> num_samples_per_epoch: {num_samples_per_epoch}"
+            )
+
+        log_single_rank(
+            logger, logging.DEBUG, f"> separate_final_epoch: {separate_final_epoch}"
+        )
+
+        numpy_random_state = numpy.random.RandomState(getattr(self.config, "random_seed"))
+
+        os.makedirs(path_to_cache, exist_ok=True)
+
+        # Write the description
+        with open(path_to_description, "wt") as writer:
+            writer.write(self.unique_description)
+
+        # Build the document index
+        log_single_rank(
+            logger,
+            logging.INFO,
+            f"\tBuild and save the document index to {os.path.basename(path_to_document_index)}",
+        )
+        t_beg = time.time()
+        document_index = _build_document_index(
+            self.indexed_indices, num_epochs, numpy_random_state, separate_final_epoch
+        )
+        numpy.save(path_to_document_index, document_index, allow_pickle=True)
+        t_end = time.time()
+        log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+        # Build the sample index
+        log_single_rank(
+            logger,
+            logging.INFO,
+            f"\tBuild and save the sample index to {os.path.basename(path_to_sample_index)}",
+        )
+        t_beg = time.time()
+        from megatron.core.datasets import helpers
+
+        assert document_index.dtype == numpy.int32
+        assert self.indexed_dataset.sequence_lengths.dtype == numpy.int32
+        sample_index = helpers.build_sample_idx(
+            self.indexed_dataset.sequence_lengths,
+            document_index,
+            sequence_length,
+            num_epochs,
+            num_tokens_per_epoch,
+        )
+        
+        if any(sample_index[:, 0] < 0):
+            _url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
+            raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
+        
+        numpy.save(path_to_sample_index, sample_index, allow_pickle=True)
+        t_end = time.time()
+        log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+        # Build the shuffle index
+        log_single_rank(
+            logger,
+            logging.INFO,
+            f"\tBuild and save the shuffle index to {os.path.basename(path_to_shuffle_index)}",
+        )
+        t_beg = time.time()
+        if separate_final_epoch:
+            shuffle_index = _build_shuffle_index(
+                num_samples_sans_final_epoch, sample_index.shape[0] - 1, numpy_random_state
+            )
+        else:
+            shuffle_index = _build_shuffle_index(
+                sample_index.shape[0] - 1, sample_index.shape[0] - 1, numpy_random_state
+            )
+        numpy.save(path_to_shuffle_index, shuffle_index, allow_pickle=True)
+        t_end = time.time()
+        log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+    log_single_rank(
+        logger, logging.INFO, f"Load the {type(self).__name__} {self.index_split.name} indices"
+    )
+
+    log_single_rank(
+        logger,
+        logging.INFO,
+        f"\tLoad the document index from {os.path.basename(path_to_document_index)}",
+    )
+    t_beg = time.time()
+    document_index = numpy.load(path_to_document_index, allow_pickle=True, mmap_mode='r')
+    t_end = time.time()
+    log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+    log_single_rank(
+        logger,
+        logging.INFO,
+        f"\tLoad the sample index from {os.path.basename(path_to_sample_index)}",
+    )
+    t_beg = time.time()
+    sample_index = numpy.load(path_to_sample_index, allow_pickle=True, mmap_mode='r')
+    
+    if any(sample_index[:, 0] < 0):
+        _url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
+        raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
+    
+    t_end = time.time()
+    log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+    log_single_rank(
+        logger,
+        logging.INFO,
+        f"\tLoad the shuffle index from {os.path.basename(path_to_shuffle_index)}",
+    )
+    t_beg = time.time()
+    shuffle_index = numpy.load(path_to_shuffle_index, allow_pickle=True, mmap_mode='r')
+    t_end = time.time()
+    log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
+
+    log_single_rank(
+        logger, logging.INFO, f"> total number of samples: {sample_index.shape[0] - 1}"
+    )
+    log_single_rank(logger, logging.INFO, f"> total number of epochs: {num_epochs}")
+
+    return document_index, sample_index, shuffle_index
--- a/modellink/core/parallel_state.py
+++ b/modellink/core/parallel_state.py
@ -23,6 +23,7 @@ import megatron
 _EXPERT_PARALLEL_GROUP = None
 _MPU_EXPERT_MODEL_PARALLEL_RANK = None
 _MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
+_PIPELINE_MODEL_PARALLEL_NODE_INFO = None


 def initialize_model_parallel_decorator(initialize_model_parallel):
@ -114,6 +115,22 @@ def initialize_model_parallel_decorator(initialize_model_parallel):
        print_rank_0(f"all tp gourps {all_tp_groups}")
        print_rank_0(f"all ep groups {all_ep_groups}")
        print_rank_0(f"all dp groups {all_data_parallel_group_ranks}")
+        
+        gpus_per_node = torch.cuda.device_count()
+        
+        # 0: Start of the pipeline_model_parallel_group
+        # 2: End of the pipeline_model_parallel_group
+        # 1: Other
+        global _PIPELINE_MODEL_PARALLEL_NODE_INFO
+        _PIPELINE_MODEL_PARALLEL_NODE_INFO = [1] * gpus_per_node
+        node_id = rank // gpus_per_node
+        for i in range(num_pipeline_model_parallel_groups):
+            ranks = range(i, world_size, num_pipeline_model_parallel_groups)
+            # When on the same node
+            if ranks[0] // gpus_per_node == node_id:
+                _PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[0] % gpus_per_node] = 0
+            if ranks[-1] // gpus_per_node == node_id:
+                _PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[-1] % gpus_per_node] = 2

    return wrapper

@ -199,3 +216,7 @@ def destroy_model_parallel_decorator(destroy_model_parallel):
        _MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None

    return wrapper
+
+
+def get_pipeline_model_parallel_node_info():
+    return _PIPELINE_MODEL_PARALLEL_NODE_INFO
--- a/modellink/error_utils.py
+++ b/modellink/error_utils.py
@ -159,3 +159,15 @@ class IsNotValidError(Exception):
 def ensure_valid(expression, error_message=None):
    if not expression:
        raise IsNotValidError(error_message)
+
+
+class GPTDatasetSampleIndexError(Exception):
+    def __init__(self, error_message):
+        super().__init__()
+        self._error_message = error_message
+
+    def __repr__(self):
+        if self._error_message:
+            return self._error_message
+        else:
+            return "Bad sample index."
--- a/modellink/model_adaptor.py
+++ b/modellink/model_adaptor.py
@ -24,7 +24,8 @@ from .core import (vocab_embedding_wrapper, initialize_model_parallel_decorator,
                   destroy_model_parallel_decorator, get_expert_parallel_group,
                   get_expert_parallel_rank, get_expert_model_parallel_rank,
                   get_expert_parallel_world_size, get_expert_model_parallel_world_size,
-                   set_expert_model_parallel_rank, set_expert_model_parallel_world_size)
+                   set_expert_model_parallel_rank, set_expert_model_parallel_world_size,
+                   _build_generic_dataset, _build_document_sample_shuffle_indices)
 from .data import build_pretraining_data_loader
 from .tokenizer import build_tokenizer
 from .arguments import parse_args_decorator
@ -79,6 +80,11 @@ def exe_adaptor():
        megatron.checkpointing._load_base_checkpoint)
    megatron.training.load_checkpoint = load_checkpoint_wrapper(
        megatron.checkpointing.load_checkpoint)
+    
+    from megatron.core.datasets.blended_megatron_dataset_builder import BlendedMegatronDatasetBuilder
+    from megatron.core.datasets.gpt_dataset import GPTDataset
+    GPTDataset._build_document_sample_shuffle_indices = _build_document_sample_shuffle_indices
+    BlendedMegatronDatasetBuilder._build_generic_dataset = _build_generic_dataset


 def set_moe_attr():
--- a/tests/ut/test_convert_ckpt_from_huggingface.py
+++ b/tests/ut/test_convert_ckpt_from_huggingface.py
@ -55,7 +55,7 @@ class TestConvertCkptFromHuggingface(unittest.TestCase):

        # encoder has a common final_norm and each one has folliowing six layers
        weight_common_content['encoder'].pop('final_norm.weight')
-        self.assertEqual(len(weight_common_content['encoder']) / 6, 32)
+        self.assertEqual(len(weight_common_content['encoder']) / 10, 32)
        self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.query_key_value.weight'].size(), torch.Size([1536, 4096]))
        self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.dense.weight'].size(), torch.Size([4096, 512]))
        self.assertEqual(weight_common_content['encoder']['layers.0.mlp.dense_h_to_4h.weight'].size(), torch.Size([2752, 4096]))