!1362 增加预训练非共享储存情况下非主节点数据缓存检测和生成

Merge pull request !1362 from glhyy/1.0.0
This commit is contained in:
glhyy 2024-06-24 07:11:13 +00:00 committed by i-robot
parent 8f4f1079c9
commit c9fa42a3b3
27 changed files with 440 additions and 37 deletions

View File

@ -34,6 +34,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -164,7 +165,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能

View File

@ -34,6 +34,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -159,7 +160,7 @@ Start pre-training Aquila-7B model:
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance

View File

@ -48,6 +48,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -276,6 +277,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -414,7 +416,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能

View File

@ -48,6 +48,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
@ -278,6 +279,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -418,7 +420,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance

View File

@ -45,6 +45,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -275,6 +276,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -407,7 +409,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能

View File

@ -45,6 +45,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -272,6 +273,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -403,7 +405,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance

View File

@ -30,6 +30,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -152,7 +153,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -242,6 +243,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -368,7 +370,7 @@ DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
bash examples/bloom/pretrain_bloom_176b.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
## 性能

View File

@ -30,6 +30,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -151,7 +152,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -241,6 +242,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -369,7 +371,7 @@ Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
bash examples/bloom/pretrain_bloom_176b.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
## Performance

View File

@ -43,6 +43,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -171,7 +172,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -235,6 +236,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -323,7 +325,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能

View File

@ -44,6 +44,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -172,7 +173,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -234,6 +235,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -322,7 +324,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance

View File

@ -46,6 +46,7 @@ LLaMA-7B/13B 训练的硬件配置如下:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -242,7 +243,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"
5.3 启动 LLaMA-7B/13B 预训练脚本
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
LLaMA-7B
@ -482,6 +483,7 @@ LLaMA-33B/65B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -677,7 +679,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"
5.3 启动预训练脚本:
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
启动 llama-33B 预训练脚本 : ./examples/llama/pretrain_llama_33B_ptd_32p.sh

View File

@ -45,6 +45,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -237,7 +238,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"
5.3 Launch LLaMA-7B/13B pre-training script.
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
LLaMA-7B
@ -466,6 +467,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -659,7 +661,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"
5.3 Launch pre-training script:
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
Launch llama-33B pre-training script : ModelLink/examples/llama/pretrain_llama_33B_ptd_32p.sh

View File

@ -54,6 +54,7 @@ LLAMA2-7B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -183,7 +184,7 @@ python tools/checkpoint/util.py
```shell
bash examples/llama2/pretrain_llama2_7b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
6. 微调
6.1 准备微调数据集
@ -406,6 +407,7 @@ LLaMA2-13B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -530,7 +532,7 @@ python tools/checkpoint/util.py \
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
6. 微调
6.1 准备微调数据集
@ -696,6 +698,7 @@ LLaMA2-34B/70B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -957,7 +960,7 @@ python tools/checkpoint/util.py \
bash examples/llama2/pretrain_llama2_70b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
6. 微调
6.1 准备微调数据集

View File

@ -52,6 +52,7 @@ Here's a hardware summary of pre-training LLAMA2-7B:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -194,7 +195,7 @@ Here's a hardware summary of pre-training LLAMA2-7B:
bash examples/llama2/pretrain_llama2_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
6. fine-tuning
@ -415,7 +416,8 @@ Here's a hardware summary of pre-training LLaMA2-13B:
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -559,7 +561,7 @@ Here's a hardware summary of pre-training LLaMA2-13B:
```shell
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
# download datasets
@ -746,6 +748,7 @@ git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -990,7 +993,7 @@ pip install -r requirements.txt
Launch pre-training script
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
LLaMA2-34B: examples/llama2/pretrain_llama2_34B_ptd_16p.sh

View File

@ -46,6 +46,7 @@
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -200,7 +201,7 @@
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在各节点同步首节点数据。
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
2. 微调

View File

@ -46,6 +46,7 @@ Recommended hardware configuration for inference:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -201,7 +202,7 @@ Recommended hardware configuration for inference:
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to copy the generated data from the first node to others.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
2. Fine-Tuning

View File

@ -54,6 +54,7 @@ Qwen-7B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -202,7 +203,7 @@ Qwen-7B 训练的硬件配置:
bash examples/qwen/pretrain_qwen_7b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -289,6 +290,7 @@ Qwen-14B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -442,7 +444,7 @@ Qwen-14B 训练的硬件配置:
bash examples/qwen/pretrain_qwen_14b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能
@ -529,6 +531,7 @@ Qwen-72B 训练的硬件配置:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -673,7 +676,7 @@ Qwen-72B 训练的硬件配置:
bash examples/qwen/pretrain_qwen_72b_ptd.sh
```
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据
### 性能

View File

@ -52,6 +52,7 @@ Here's a hardware summary of pre-training Qwen-7B:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -198,7 +199,7 @@ Here's a hardware summary of pre-training Qwen-7B:
```shell
bash examples/qwen/pretrain_qwen_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -283,6 +284,7 @@ Here's a hardware summary of pre-training Qwen-14B:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -523,6 +525,7 @@ Here's a hardware summary of pre-training Qwen-72B:
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0.0
mkdir logs
mkdir model_from_hf
mkdir dataset

View File

@ -43,6 +43,7 @@ def process_args(parser):
parser = _add_data_args(parser)
parser = _add_moe_args(parser)
parser = _add_num_layer_allocation(parser)
parser = _add_dataset_args(parser)
return parser
@ -122,3 +123,13 @@ def _add_network_size_args(parser):
help='set padded vocab size'
)
return parser
def _add_dataset_args(parser):
group = parser.add_argument_group(title='dataset_args')
group.add_argument('--no-shared-storage',
action='store_true',
default=False,
help='if no shared storage, set it'
)
return parser

View File

@ -19,3 +19,5 @@ from .parallel_state import (initialize_model_parallel_decorator, destroy_model_
get_expert_model_parallel_world_size, get_expert_parallel_group,
get_expert_parallel_rank, get_expert_parallel_world_size,
set_expert_model_parallel_world_size, set_expert_model_parallel_rank)
from .datasets.blended_megatron_dataset_builder import _build_generic_dataset
from .datasets.gpt_dataset import _build_document_sample_shuffle_indices

View File

View File

@ -0,0 +1,82 @@
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
import logging
from typing import Any, Optional, Type, Union
import torch
from megatron import get_args
from megatron.core.datasets.blended_dataset import BlendedDataset
from megatron.core.datasets.indexed_dataset import MMapIndexedDataset
from megatron.core.datasets.megatron_dataset import MegatronDataset
logger = logging.getLogger(__name__)
DistributedDataset = Union[BlendedDataset, MegatronDataset, MMapIndexedDataset]
from ..parallel_state import get_pipeline_model_parallel_node_info
def need_to_build_dataset():
args = get_args()
share_save = not args.no_shared_storage
rank = torch.distributed.get_rank()
if share_save:
return rank == 0
gpus_per_node = torch.cuda.device_count()
node_pp_group_info = get_pipeline_model_parallel_node_info()
flag = False
num_edge_ranks = sum([x != 1 for x in node_pp_group_info])
if num_edge_ranks >= 1:
first_idx = node_pp_group_info.index([x for x in node_pp_group_info if x != 1][0])
flag = (first_idx == rank % gpus_per_node)
return flag
def _build_generic_dataset(
self, cls: Type[DistributedDataset], *args: Any,
) -> Optional[DistributedDataset]:
"""Build the DistributedDataset
Return None if and only if the underlying MegatronDataset class is not built on the current
rank and torch.distributed is initialized.
Args:
cls (Type[DistributedDataset]): The DistributedDataset class to be built
args (Tuple[Any]): The positional arguments used to build the provided
DistributedDataset class
Raises:
Exception: When the dataset constructor raises an OSError
Returns:
Optional[DistributedDataset]: The DistributedDataset instantion or None
"""
if torch.distributed.is_initialized():
dataset = None
# First, build on ranks of first and last of pp group
to_build_flag = need_to_build_dataset()
if to_build_flag and getattr(self.config, "is_built_on_rank")():
try:
dataset = cls(*args)
except OSError as err:
log = (
f"Failed to write dataset materials to the data cache directory. "
+ f"Please supply a directory to which you have write access via "
+ f"the path_to_cache attribute in BlendedMegatronDatasetConfig and "
+ f"retry. Refer to the preserved traceback above for more information."
)
raise Exception(log) from err
torch.distributed.barrier()
# After, build on other ranks
if not to_build_flag and getattr(self.config, "is_built_on_rank")():
dataset = cls(*args)
return dataset
return cls(*args)

View File

@ -0,0 +1,233 @@
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
import logging
import os
import time
from typing import Tuple
import numpy
from megatron.core.datasets.utils import log_single_rank
from megatron.core.datasets.gpt_dataset import (_get_num_tokens_per_epoch,
_get_num_epochs,
_build_document_index,
_build_shuffle_index
)
from modellink.error_utils import GPTDatasetSampleIndexError
from .blended_megatron_dataset_builder import need_to_build_dataset
logger = logging.getLogger(__name__)
def _build_document_sample_shuffle_indices(
self,
) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:
"""Build the document index, the sample index, and the shuffle index
The document index:
-- 1-D
-- An ordered array of document ids
The sample index:
-- 2-D
-- The document indices and offsets which mark the start of every sample
The shuffle index:
-- 1-D
-- A random permutation of index range of the sample index
Returns:
Tuple[numpy.ndarray, numpy.ndarray]: The document index, the sample index, and the
shuffle index
TODO: Explain the 80% threshold
"""
path_to_cache = getattr(self.config, "path_to_cache")
if path_to_cache is None:
path_to_cache = os.path.join(
self.indexed_dataset.path_prefix, "cache", f"{type(self).__name__}_indices"
)
get_path_to = lambda suffix: os.path.join(
path_to_cache, f"{self.unique_description_hash}-{type(self).__name__}-{suffix}"
)
path_to_description = get_path_to("description.txt")
path_to_document_index = get_path_to("document_index.npy")
path_to_sample_index = get_path_to("sample_index.npy")
path_to_shuffle_index = get_path_to("shuffle_index.npy")
cache_hit = all(
map(
os.path.isfile,
[
path_to_description,
path_to_document_index,
path_to_sample_index,
path_to_shuffle_index,
],
)
)
num_tokens_per_epoch = _get_num_tokens_per_epoch(self.indexed_dataset, self.indexed_indices)
sequence_length = getattr(self.config, "sequence_length")
num_epochs = _get_num_epochs(num_tokens_per_epoch, sequence_length, self.num_samples)
# When the rank on the first or last stage of the pipeline_model_parallel_group,
# it need to build dataset
if not cache_hit and need_to_build_dataset():
log_single_rank(
logger,
logging.INFO,
f"Build and save the {type(self).__name__} {self.index_split.name} indices",
)
if num_epochs == 1:
separate_final_epoch = False
else:
# Get the number of samples for the last epoch
num_samples_sans_final_epoch = (
(num_epochs - 1) * num_tokens_per_epoch - 1
) // sequence_length
num_samples_from_final_epoch = self.num_samples - num_samples_sans_final_epoch
num_samples_per_epoch = (num_tokens_per_epoch - 1) // sequence_length
# num_samples_from_final_epoch should be non-negative
assert num_samples_from_final_epoch >= 0
# num_samples_from_final_epoch should not exceed max value
assert num_samples_from_final_epoch <= num_samples_per_epoch + 1
# Separate the final epoch if it falls below the threshold
threshold = 0.80
separate_final_epoch = num_samples_from_final_epoch < int(
threshold * num_samples_per_epoch
)
log_single_rank(
logger,
logging.DEBUG,
f"> num_samples_from_final_epoch: {num_samples_from_final_epoch}",
)
log_single_rank(logger, logging.DEBUG, f"> threshold: {threshold}")
log_single_rank(
logger, logging.DEBUG, f"> num_samples_per_epoch: {num_samples_per_epoch}"
)
log_single_rank(
logger, logging.DEBUG, f"> separate_final_epoch: {separate_final_epoch}"
)
numpy_random_state = numpy.random.RandomState(getattr(self.config, "random_seed"))
os.makedirs(path_to_cache, exist_ok=True)
# Write the description
with open(path_to_description, "wt") as writer:
writer.write(self.unique_description)
# Build the document index
log_single_rank(
logger,
logging.INFO,
f"\tBuild and save the document index to {os.path.basename(path_to_document_index)}",
)
t_beg = time.time()
document_index = _build_document_index(
self.indexed_indices, num_epochs, numpy_random_state, separate_final_epoch
)
numpy.save(path_to_document_index, document_index, allow_pickle=True)
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
# Build the sample index
log_single_rank(
logger,
logging.INFO,
f"\tBuild and save the sample index to {os.path.basename(path_to_sample_index)}",
)
t_beg = time.time()
from megatron.core.datasets import helpers
assert document_index.dtype == numpy.int32
assert self.indexed_dataset.sequence_lengths.dtype == numpy.int32
sample_index = helpers.build_sample_idx(
self.indexed_dataset.sequence_lengths,
document_index,
sequence_length,
num_epochs,
num_tokens_per_epoch,
)
if any(sample_index[:, 0] < 0):
_url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
numpy.save(path_to_sample_index, sample_index, allow_pickle=True)
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
# Build the shuffle index
log_single_rank(
logger,
logging.INFO,
f"\tBuild and save the shuffle index to {os.path.basename(path_to_shuffle_index)}",
)
t_beg = time.time()
if separate_final_epoch:
shuffle_index = _build_shuffle_index(
num_samples_sans_final_epoch, sample_index.shape[0] - 1, numpy_random_state
)
else:
shuffle_index = _build_shuffle_index(
sample_index.shape[0] - 1, sample_index.shape[0] - 1, numpy_random_state
)
numpy.save(path_to_shuffle_index, shuffle_index, allow_pickle=True)
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
log_single_rank(
logger, logging.INFO, f"Load the {type(self).__name__} {self.index_split.name} indices"
)
log_single_rank(
logger,
logging.INFO,
f"\tLoad the document index from {os.path.basename(path_to_document_index)}",
)
t_beg = time.time()
document_index = numpy.load(path_to_document_index, allow_pickle=True, mmap_mode='r')
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
log_single_rank(
logger,
logging.INFO,
f"\tLoad the sample index from {os.path.basename(path_to_sample_index)}",
)
t_beg = time.time()
sample_index = numpy.load(path_to_sample_index, allow_pickle=True, mmap_mode='r')
if any(sample_index[:, 0] < 0):
_url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
log_single_rank(
logger,
logging.INFO,
f"\tLoad the shuffle index from {os.path.basename(path_to_shuffle_index)}",
)
t_beg = time.time()
shuffle_index = numpy.load(path_to_shuffle_index, allow_pickle=True, mmap_mode='r')
t_end = time.time()
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
log_single_rank(
logger, logging.INFO, f"> total number of samples: {sample_index.shape[0] - 1}"
)
log_single_rank(logger, logging.INFO, f"> total number of epochs: {num_epochs}")
return document_index, sample_index, shuffle_index

View File

@ -23,6 +23,7 @@ import megatron
_EXPERT_PARALLEL_GROUP = None
_MPU_EXPERT_MODEL_PARALLEL_RANK = None
_MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
_PIPELINE_MODEL_PARALLEL_NODE_INFO = None
def initialize_model_parallel_decorator(initialize_model_parallel):
@ -114,6 +115,22 @@ def initialize_model_parallel_decorator(initialize_model_parallel):
print_rank_0(f"all tp gourps {all_tp_groups}")
print_rank_0(f"all ep groups {all_ep_groups}")
print_rank_0(f"all dp groups {all_data_parallel_group_ranks}")
gpus_per_node = torch.cuda.device_count()
# 0: Start of the pipeline_model_parallel_group
# 2: End of the pipeline_model_parallel_group
# 1: Other
global _PIPELINE_MODEL_PARALLEL_NODE_INFO
_PIPELINE_MODEL_PARALLEL_NODE_INFO = [1] * gpus_per_node
node_id = rank // gpus_per_node
for i in range(num_pipeline_model_parallel_groups):
ranks = range(i, world_size, num_pipeline_model_parallel_groups)
# When on the same node
if ranks[0] // gpus_per_node == node_id:
_PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[0] % gpus_per_node] = 0
if ranks[-1] // gpus_per_node == node_id:
_PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[-1] % gpus_per_node] = 2
return wrapper
@ -199,3 +216,7 @@ def destroy_model_parallel_decorator(destroy_model_parallel):
_MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
return wrapper
def get_pipeline_model_parallel_node_info():
return _PIPELINE_MODEL_PARALLEL_NODE_INFO

View File

@ -159,3 +159,15 @@ class IsNotValidError(Exception):
def ensure_valid(expression, error_message=None):
if not expression:
raise IsNotValidError(error_message)
class GPTDatasetSampleIndexError(Exception):
def __init__(self, error_message):
super().__init__()
self._error_message = error_message
def __repr__(self):
if self._error_message:
return self._error_message
else:
return "Bad sample index."

View File

@ -24,7 +24,8 @@ from .core import (vocab_embedding_wrapper, initialize_model_parallel_decorator,
destroy_model_parallel_decorator, get_expert_parallel_group,
get_expert_parallel_rank, get_expert_model_parallel_rank,
get_expert_parallel_world_size, get_expert_model_parallel_world_size,
set_expert_model_parallel_rank, set_expert_model_parallel_world_size)
set_expert_model_parallel_rank, set_expert_model_parallel_world_size,
_build_generic_dataset, _build_document_sample_shuffle_indices)
from .data import build_pretraining_data_loader
from .tokenizer import build_tokenizer
from .arguments import parse_args_decorator
@ -79,6 +80,11 @@ def exe_adaptor():
megatron.checkpointing._load_base_checkpoint)
megatron.training.load_checkpoint = load_checkpoint_wrapper(
megatron.checkpointing.load_checkpoint)
from megatron.core.datasets.blended_megatron_dataset_builder import BlendedMegatronDatasetBuilder
from megatron.core.datasets.gpt_dataset import GPTDataset
GPTDataset._build_document_sample_shuffle_indices = _build_document_sample_shuffle_indices
BlendedMegatronDatasetBuilder._build_generic_dataset = _build_generic_dataset
def set_moe_attr():

View File

@ -55,7 +55,7 @@ class TestConvertCkptFromHuggingface(unittest.TestCase):
# encoder has a common final_norm and each one has folliowing six layers
weight_common_content['encoder'].pop('final_norm.weight')
self.assertEqual(len(weight_common_content['encoder']) / 6, 32)
self.assertEqual(len(weight_common_content['encoder']) / 10, 32)
self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.query_key_value.weight'].size(), torch.Size([1536, 4096]))
self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.dense.weight'].size(), torch.Size([4096, 512]))
self.assertEqual(weight_common_content['encoder']['layers.0.mlp.dense_h_to_4h.weight'].size(), torch.Size([2752, 4096]))