!1362 增加预训练非共享储存情况下非主节点数据缓存检测和生成
Merge pull request !1362 from glhyy/1.0.0
This commit is contained in:
parent
8f4f1079c9
commit
c9fa42a3b3
|
@ -34,6 +34,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -164,7 +165,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
|||
bash examples/aquila/pretrain_aquila_7b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
|
|
@ -34,6 +34,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -159,7 +160,7 @@ Start pre-training Aquila-7B model:
|
|||
bash examples/aquila/pretrain_aquila_7b_ptd.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
|
|
@ -48,6 +48,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
@ -276,6 +277,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -414,7 +416,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
|
|
@ -48,6 +48,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
|
||||
|
||||
|
@ -278,6 +279,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -418,7 +420,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
|
|
@ -45,6 +45,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -178,7 +179,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
@ -275,6 +276,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -407,7 +409,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
|
|
@ -45,6 +45,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -177,7 +178,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
@ -272,6 +273,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -403,7 +405,7 @@ CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
|
|||
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
|
|
@ -30,6 +30,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -152,7 +153,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
|
|||
bash examples/bloom/pretrain_bloom_ptd_7B.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
@ -242,6 +243,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -368,7 +370,7 @@ DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
|
|||
bash examples/bloom/pretrain_bloom_176b.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
## 性能
|
||||
|
||||
|
|
|
@ -30,6 +30,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -151,7 +152,7 @@ CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
|
|||
bash examples/bloom/pretrain_bloom_ptd_7B.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
|
||||
### Performance
|
||||
|
@ -241,6 +242,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -369,7 +371,7 @@ Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
|
|||
bash examples/bloom/pretrain_bloom_176b.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
## Performance
|
||||
|
||||
|
|
|
@ -43,6 +43,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -171,7 +172,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
|
|||
bash examples/intern/pretrain_internlm_7b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
|
||||
### 性能
|
||||
|
@ -235,6 +236,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -323,7 +325,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
|
|||
bash examples/intern/pretrain_internlm_65b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
|
|
@ -44,6 +44,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -172,7 +173,7 @@ DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
|
|||
bash examples/intern/pretrain_internlm_7b_ptd.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
@ -234,6 +235,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -322,7 +324,7 @@ DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
|
|||
bash examples/intern/pretrain_internlm_65b_ptd.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
|
||||
### Performance
|
||||
|
|
|
@ -46,6 +46,7 @@ LLaMA-7B/13B 训练的硬件配置如下:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -242,7 +243,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"
|
|||
|
||||
5.3 启动 LLaMA-7B/13B 预训练脚本
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
LLaMA-7B
|
||||
|
||||
|
@ -482,6 +483,7 @@ LLaMA-33B/65B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -677,7 +679,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"
|
|||
|
||||
5.3 启动预训练脚本:
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
启动 llama-33B 预训练脚本 : ./examples/llama/pretrain_llama_33B_ptd_32p.sh
|
||||
|
||||
|
|
|
@ -45,6 +45,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -237,7 +238,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-13b-hf/"
|
|||
|
||||
5.3 Launch LLaMA-7B/13B pre-training script.
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
LLaMA-7B
|
||||
|
||||
|
@ -466,6 +467,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -659,7 +661,7 @@ SAVE_CHECKPOINT_PATH="./ckpt/llama-65b-hf/"
|
|||
|
||||
5.3 Launch pre-training script:
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
Launch llama-33B pre-training script : ModelLink/examples/llama/pretrain_llama_33B_ptd_32p.sh
|
||||
|
||||
|
|
|
@ -54,6 +54,7 @@ LLAMA2-7B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -183,7 +184,7 @@ python tools/checkpoint/util.py
|
|||
```shell
|
||||
bash examples/llama2/pretrain_llama2_7b_ptd.sh
|
||||
```
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
6. 微调
|
||||
|
||||
6.1 准备微调数据集
|
||||
|
@ -406,6 +407,7 @@ LLaMA2-13B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -530,7 +532,7 @@ python tools/checkpoint/util.py \
|
|||
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
6. 微调
|
||||
|
||||
6.1 准备微调数据集
|
||||
|
@ -696,6 +698,7 @@ LLaMA2-34B/70B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -957,7 +960,7 @@ python tools/checkpoint/util.py \
|
|||
bash examples/llama2/pretrain_llama2_70b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
6. 微调
|
||||
|
||||
6.1 准备微调数据集
|
||||
|
|
|
@ -52,6 +52,7 @@ Here's a hardware summary of pre-training LLAMA2-7B:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -194,7 +195,7 @@ Here's a hardware summary of pre-training LLAMA2-7B:
|
|||
bash examples/llama2/pretrain_llama2_7b_ptd.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
6. fine-tuning
|
||||
|
||||
|
@ -415,7 +416,8 @@ Here's a hardware summary of pre-training LLaMA2-13B:
|
|||
git checkout -f bcce6f
|
||||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -559,7 +561,7 @@ Here's a hardware summary of pre-training LLaMA2-13B:
|
|||
```shell
|
||||
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
```
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
```shell
|
||||
# download datasets
|
||||
|
@ -746,6 +748,7 @@ git checkout -f bcce6f
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -990,7 +993,7 @@ pip install -r requirements.txt
|
|||
|
||||
Launch pre-training script
|
||||
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
LLaMA2-34B: examples/llama2/pretrain_llama2_34B_ptd_16p.sh
|
||||
|
||||
|
|
|
@ -46,6 +46,7 @@
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -200,7 +201,7 @@
|
|||
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在各节点同步首节点数据。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
2. 微调
|
||||
|
||||
|
|
|
@ -46,6 +46,7 @@ Recommended hardware configuration for inference:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -201,7 +202,7 @@ Recommended hardware configuration for inference:
|
|||
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
|
||||
```
|
||||
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to copy the generated data from the first node to others.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
2. Fine-Tuning
|
||||
|
||||
|
|
|
@ -54,6 +54,7 @@ Qwen-7B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -202,7 +203,7 @@ Qwen-7B 训练的硬件配置:
|
|||
bash examples/qwen/pretrain_qwen_7b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
@ -289,6 +290,7 @@ Qwen-14B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -442,7 +444,7 @@ Qwen-14B 训练的硬件配置:
|
|||
bash examples/qwen/pretrain_qwen_14b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
@ -529,6 +531,7 @@ Qwen-72B 训练的硬件配置:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -673,7 +676,7 @@ Qwen-72B 训练的硬件配置:
|
|||
bash examples/qwen/pretrain_qwen_72b_ptd.sh
|
||||
```
|
||||
|
||||
**注意**:如果使用多机训练,需要设置多机数据共享,非主节点通过数据共享读取主节点数据。或者,直接将主节点生成的数据复制到非主节点。
|
||||
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数,设置此参数之后将会根据分布式参数判断非主节点是否需要load数据,并检查相应缓存和生成数据。
|
||||
|
||||
### 性能
|
||||
|
||||
|
|
|
@ -52,6 +52,7 @@ Here's a hardware summary of pre-training Qwen-7B:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -198,7 +199,7 @@ Here's a hardware summary of pre-training Qwen-7B:
|
|||
```shell
|
||||
bash examples/qwen/pretrain_qwen_7b_ptd.sh
|
||||
```
|
||||
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
|
||||
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
|
||||
|
||||
### Performance
|
||||
|
||||
|
@ -283,6 +284,7 @@ Here's a hardware summary of pre-training Qwen-14B:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
@ -523,6 +525,7 @@ Here's a hardware summary of pre-training Qwen-72B:
|
|||
cp -r megatron ../ModelLink/
|
||||
cd ..
|
||||
cd ModelLink
|
||||
git checkout 1.0.0
|
||||
mkdir logs
|
||||
mkdir model_from_hf
|
||||
mkdir dataset
|
||||
|
|
|
@ -43,6 +43,7 @@ def process_args(parser):
|
|||
parser = _add_data_args(parser)
|
||||
parser = _add_moe_args(parser)
|
||||
parser = _add_num_layer_allocation(parser)
|
||||
parser = _add_dataset_args(parser)
|
||||
return parser
|
||||
|
||||
|
||||
|
@ -122,3 +123,13 @@ def _add_network_size_args(parser):
|
|||
help='set padded vocab size'
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def _add_dataset_args(parser):
|
||||
group = parser.add_argument_group(title='dataset_args')
|
||||
group.add_argument('--no-shared-storage',
|
||||
action='store_true',
|
||||
default=False,
|
||||
help='if no shared storage, set it'
|
||||
)
|
||||
return parser
|
||||
|
|
|
@ -19,3 +19,5 @@ from .parallel_state import (initialize_model_parallel_decorator, destroy_model_
|
|||
get_expert_model_parallel_world_size, get_expert_parallel_group,
|
||||
get_expert_parallel_rank, get_expert_parallel_world_size,
|
||||
set_expert_model_parallel_world_size, set_expert_model_parallel_rank)
|
||||
from .datasets.blended_megatron_dataset_builder import _build_generic_dataset
|
||||
from .datasets.gpt_dataset import _build_document_sample_shuffle_indices
|
||||
|
|
|
@ -0,0 +1,82 @@
|
|||
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
|
||||
|
||||
import logging
|
||||
from typing import Any, Optional, Type, Union
|
||||
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from megatron.core.datasets.blended_dataset import BlendedDataset
|
||||
from megatron.core.datasets.indexed_dataset import MMapIndexedDataset
|
||||
from megatron.core.datasets.megatron_dataset import MegatronDataset
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DistributedDataset = Union[BlendedDataset, MegatronDataset, MMapIndexedDataset]
|
||||
|
||||
from ..parallel_state import get_pipeline_model_parallel_node_info
|
||||
|
||||
|
||||
def need_to_build_dataset():
|
||||
args = get_args()
|
||||
share_save = not args.no_shared_storage
|
||||
rank = torch.distributed.get_rank()
|
||||
if share_save:
|
||||
return rank == 0
|
||||
gpus_per_node = torch.cuda.device_count()
|
||||
node_pp_group_info = get_pipeline_model_parallel_node_info()
|
||||
flag = False
|
||||
num_edge_ranks = sum([x != 1 for x in node_pp_group_info])
|
||||
if num_edge_ranks >= 1:
|
||||
first_idx = node_pp_group_info.index([x for x in node_pp_group_info if x != 1][0])
|
||||
flag = (first_idx == rank % gpus_per_node)
|
||||
return flag
|
||||
|
||||
|
||||
def _build_generic_dataset(
|
||||
self, cls: Type[DistributedDataset], *args: Any,
|
||||
) -> Optional[DistributedDataset]:
|
||||
"""Build the DistributedDataset
|
||||
|
||||
Return None if and only if the underlying MegatronDataset class is not built on the current
|
||||
rank and torch.distributed is initialized.
|
||||
|
||||
Args:
|
||||
cls (Type[DistributedDataset]): The DistributedDataset class to be built
|
||||
|
||||
args (Tuple[Any]): The positional arguments used to build the provided
|
||||
DistributedDataset class
|
||||
|
||||
Raises:
|
||||
Exception: When the dataset constructor raises an OSError
|
||||
|
||||
Returns:
|
||||
Optional[DistributedDataset]: The DistributedDataset instantion or None
|
||||
"""
|
||||
if torch.distributed.is_initialized():
|
||||
|
||||
dataset = None
|
||||
|
||||
# First, build on ranks of first and last of pp group
|
||||
to_build_flag = need_to_build_dataset()
|
||||
if to_build_flag and getattr(self.config, "is_built_on_rank")():
|
||||
try:
|
||||
dataset = cls(*args)
|
||||
except OSError as err:
|
||||
log = (
|
||||
f"Failed to write dataset materials to the data cache directory. "
|
||||
+ f"Please supply a directory to which you have write access via "
|
||||
+ f"the path_to_cache attribute in BlendedMegatronDatasetConfig and "
|
||||
+ f"retry. Refer to the preserved traceback above for more information."
|
||||
)
|
||||
raise Exception(log) from err
|
||||
|
||||
torch.distributed.barrier()
|
||||
|
||||
# After, build on other ranks
|
||||
if not to_build_flag and getattr(self.config, "is_built_on_rank")():
|
||||
dataset = cls(*args)
|
||||
|
||||
return dataset
|
||||
|
||||
return cls(*args)
|
|
@ -0,0 +1,233 @@
|
|||
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
|
||||
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from typing import Tuple
|
||||
|
||||
import numpy
|
||||
|
||||
from megatron.core.datasets.utils import log_single_rank
|
||||
from megatron.core.datasets.gpt_dataset import (_get_num_tokens_per_epoch,
|
||||
_get_num_epochs,
|
||||
_build_document_index,
|
||||
_build_shuffle_index
|
||||
)
|
||||
from modellink.error_utils import GPTDatasetSampleIndexError
|
||||
from .blended_megatron_dataset_builder import need_to_build_dataset
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _build_document_sample_shuffle_indices(
|
||||
self,
|
||||
) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:
|
||||
"""Build the document index, the sample index, and the shuffle index
|
||||
|
||||
The document index:
|
||||
-- 1-D
|
||||
-- An ordered array of document ids
|
||||
|
||||
The sample index:
|
||||
-- 2-D
|
||||
-- The document indices and offsets which mark the start of every sample
|
||||
|
||||
The shuffle index:
|
||||
-- 1-D
|
||||
-- A random permutation of index range of the sample index
|
||||
|
||||
Returns:
|
||||
Tuple[numpy.ndarray, numpy.ndarray]: The document index, the sample index, and the
|
||||
shuffle index
|
||||
|
||||
TODO: Explain the 80% threshold
|
||||
"""
|
||||
path_to_cache = getattr(self.config, "path_to_cache")
|
||||
if path_to_cache is None:
|
||||
path_to_cache = os.path.join(
|
||||
self.indexed_dataset.path_prefix, "cache", f"{type(self).__name__}_indices"
|
||||
)
|
||||
|
||||
get_path_to = lambda suffix: os.path.join(
|
||||
path_to_cache, f"{self.unique_description_hash}-{type(self).__name__}-{suffix}"
|
||||
)
|
||||
path_to_description = get_path_to("description.txt")
|
||||
path_to_document_index = get_path_to("document_index.npy")
|
||||
path_to_sample_index = get_path_to("sample_index.npy")
|
||||
path_to_shuffle_index = get_path_to("shuffle_index.npy")
|
||||
cache_hit = all(
|
||||
map(
|
||||
os.path.isfile,
|
||||
[
|
||||
path_to_description,
|
||||
path_to_document_index,
|
||||
path_to_sample_index,
|
||||
path_to_shuffle_index,
|
||||
],
|
||||
)
|
||||
)
|
||||
|
||||
num_tokens_per_epoch = _get_num_tokens_per_epoch(self.indexed_dataset, self.indexed_indices)
|
||||
|
||||
sequence_length = getattr(self.config, "sequence_length")
|
||||
|
||||
num_epochs = _get_num_epochs(num_tokens_per_epoch, sequence_length, self.num_samples)
|
||||
|
||||
# When the rank on the first or last stage of the pipeline_model_parallel_group,
|
||||
# it need to build dataset
|
||||
if not cache_hit and need_to_build_dataset():
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"Build and save the {type(self).__name__} {self.index_split.name} indices",
|
||||
)
|
||||
|
||||
if num_epochs == 1:
|
||||
separate_final_epoch = False
|
||||
else:
|
||||
# Get the number of samples for the last epoch
|
||||
num_samples_sans_final_epoch = (
|
||||
(num_epochs - 1) * num_tokens_per_epoch - 1
|
||||
) // sequence_length
|
||||
num_samples_from_final_epoch = self.num_samples - num_samples_sans_final_epoch
|
||||
num_samples_per_epoch = (num_tokens_per_epoch - 1) // sequence_length
|
||||
|
||||
# num_samples_from_final_epoch should be non-negative
|
||||
assert num_samples_from_final_epoch >= 0
|
||||
|
||||
# num_samples_from_final_epoch should not exceed max value
|
||||
assert num_samples_from_final_epoch <= num_samples_per_epoch + 1
|
||||
|
||||
# Separate the final epoch if it falls below the threshold
|
||||
threshold = 0.80
|
||||
separate_final_epoch = num_samples_from_final_epoch < int(
|
||||
threshold * num_samples_per_epoch
|
||||
)
|
||||
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.DEBUG,
|
||||
f"> num_samples_from_final_epoch: {num_samples_from_final_epoch}",
|
||||
)
|
||||
log_single_rank(logger, logging.DEBUG, f"> threshold: {threshold}")
|
||||
log_single_rank(
|
||||
logger, logging.DEBUG, f"> num_samples_per_epoch: {num_samples_per_epoch}"
|
||||
)
|
||||
|
||||
log_single_rank(
|
||||
logger, logging.DEBUG, f"> separate_final_epoch: {separate_final_epoch}"
|
||||
)
|
||||
|
||||
numpy_random_state = numpy.random.RandomState(getattr(self.config, "random_seed"))
|
||||
|
||||
os.makedirs(path_to_cache, exist_ok=True)
|
||||
|
||||
# Write the description
|
||||
with open(path_to_description, "wt") as writer:
|
||||
writer.write(self.unique_description)
|
||||
|
||||
# Build the document index
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tBuild and save the document index to {os.path.basename(path_to_document_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
document_index = _build_document_index(
|
||||
self.indexed_indices, num_epochs, numpy_random_state, separate_final_epoch
|
||||
)
|
||||
numpy.save(path_to_document_index, document_index, allow_pickle=True)
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
# Build the sample index
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tBuild and save the sample index to {os.path.basename(path_to_sample_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
from megatron.core.datasets import helpers
|
||||
|
||||
assert document_index.dtype == numpy.int32
|
||||
assert self.indexed_dataset.sequence_lengths.dtype == numpy.int32
|
||||
sample_index = helpers.build_sample_idx(
|
||||
self.indexed_dataset.sequence_lengths,
|
||||
document_index,
|
||||
sequence_length,
|
||||
num_epochs,
|
||||
num_tokens_per_epoch,
|
||||
)
|
||||
|
||||
if any(sample_index[:, 0] < 0):
|
||||
_url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
|
||||
raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
|
||||
|
||||
numpy.save(path_to_sample_index, sample_index, allow_pickle=True)
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
# Build the shuffle index
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tBuild and save the shuffle index to {os.path.basename(path_to_shuffle_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
if separate_final_epoch:
|
||||
shuffle_index = _build_shuffle_index(
|
||||
num_samples_sans_final_epoch, sample_index.shape[0] - 1, numpy_random_state
|
||||
)
|
||||
else:
|
||||
shuffle_index = _build_shuffle_index(
|
||||
sample_index.shape[0] - 1, sample_index.shape[0] - 1, numpy_random_state
|
||||
)
|
||||
numpy.save(path_to_shuffle_index, shuffle_index, allow_pickle=True)
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
log_single_rank(
|
||||
logger, logging.INFO, f"Load the {type(self).__name__} {self.index_split.name} indices"
|
||||
)
|
||||
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tLoad the document index from {os.path.basename(path_to_document_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
document_index = numpy.load(path_to_document_index, allow_pickle=True, mmap_mode='r')
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tLoad the sample index from {os.path.basename(path_to_sample_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
sample_index = numpy.load(path_to_sample_index, allow_pickle=True, mmap_mode='r')
|
||||
|
||||
if any(sample_index[:, 0] < 0):
|
||||
_url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
|
||||
raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
|
||||
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
log_single_rank(
|
||||
logger,
|
||||
logging.INFO,
|
||||
f"\tLoad the shuffle index from {os.path.basename(path_to_shuffle_index)}",
|
||||
)
|
||||
t_beg = time.time()
|
||||
shuffle_index = numpy.load(path_to_shuffle_index, allow_pickle=True, mmap_mode='r')
|
||||
t_end = time.time()
|
||||
log_single_rank(logger, logging.DEBUG, f"\t> time elapsed: {t_end - t_beg:4f} seconds")
|
||||
|
||||
log_single_rank(
|
||||
logger, logging.INFO, f"> total number of samples: {sample_index.shape[0] - 1}"
|
||||
)
|
||||
log_single_rank(logger, logging.INFO, f"> total number of epochs: {num_epochs}")
|
||||
|
||||
return document_index, sample_index, shuffle_index
|
|
@ -23,6 +23,7 @@ import megatron
|
|||
_EXPERT_PARALLEL_GROUP = None
|
||||
_MPU_EXPERT_MODEL_PARALLEL_RANK = None
|
||||
_MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
|
||||
_PIPELINE_MODEL_PARALLEL_NODE_INFO = None
|
||||
|
||||
|
||||
def initialize_model_parallel_decorator(initialize_model_parallel):
|
||||
|
@ -114,6 +115,22 @@ def initialize_model_parallel_decorator(initialize_model_parallel):
|
|||
print_rank_0(f"all tp gourps {all_tp_groups}")
|
||||
print_rank_0(f"all ep groups {all_ep_groups}")
|
||||
print_rank_0(f"all dp groups {all_data_parallel_group_ranks}")
|
||||
|
||||
gpus_per_node = torch.cuda.device_count()
|
||||
|
||||
# 0: Start of the pipeline_model_parallel_group
|
||||
# 2: End of the pipeline_model_parallel_group
|
||||
# 1: Other
|
||||
global _PIPELINE_MODEL_PARALLEL_NODE_INFO
|
||||
_PIPELINE_MODEL_PARALLEL_NODE_INFO = [1] * gpus_per_node
|
||||
node_id = rank // gpus_per_node
|
||||
for i in range(num_pipeline_model_parallel_groups):
|
||||
ranks = range(i, world_size, num_pipeline_model_parallel_groups)
|
||||
# When on the same node
|
||||
if ranks[0] // gpus_per_node == node_id:
|
||||
_PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[0] % gpus_per_node] = 0
|
||||
if ranks[-1] // gpus_per_node == node_id:
|
||||
_PIPELINE_MODEL_PARALLEL_NODE_INFO[ranks[-1] % gpus_per_node] = 2
|
||||
|
||||
return wrapper
|
||||
|
||||
|
@ -199,3 +216,7 @@ def destroy_model_parallel_decorator(destroy_model_parallel):
|
|||
_MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def get_pipeline_model_parallel_node_info():
|
||||
return _PIPELINE_MODEL_PARALLEL_NODE_INFO
|
||||
|
|
|
@ -159,3 +159,15 @@ class IsNotValidError(Exception):
|
|||
def ensure_valid(expression, error_message=None):
|
||||
if not expression:
|
||||
raise IsNotValidError(error_message)
|
||||
|
||||
|
||||
class GPTDatasetSampleIndexError(Exception):
|
||||
def __init__(self, error_message):
|
||||
super().__init__()
|
||||
self._error_message = error_message
|
||||
|
||||
def __repr__(self):
|
||||
if self._error_message:
|
||||
return self._error_message
|
||||
else:
|
||||
return "Bad sample index."
|
|
@ -24,7 +24,8 @@ from .core import (vocab_embedding_wrapper, initialize_model_parallel_decorator,
|
|||
destroy_model_parallel_decorator, get_expert_parallel_group,
|
||||
get_expert_parallel_rank, get_expert_model_parallel_rank,
|
||||
get_expert_parallel_world_size, get_expert_model_parallel_world_size,
|
||||
set_expert_model_parallel_rank, set_expert_model_parallel_world_size)
|
||||
set_expert_model_parallel_rank, set_expert_model_parallel_world_size,
|
||||
_build_generic_dataset, _build_document_sample_shuffle_indices)
|
||||
from .data import build_pretraining_data_loader
|
||||
from .tokenizer import build_tokenizer
|
||||
from .arguments import parse_args_decorator
|
||||
|
@ -79,6 +80,11 @@ def exe_adaptor():
|
|||
megatron.checkpointing._load_base_checkpoint)
|
||||
megatron.training.load_checkpoint = load_checkpoint_wrapper(
|
||||
megatron.checkpointing.load_checkpoint)
|
||||
|
||||
from megatron.core.datasets.blended_megatron_dataset_builder import BlendedMegatronDatasetBuilder
|
||||
from megatron.core.datasets.gpt_dataset import GPTDataset
|
||||
GPTDataset._build_document_sample_shuffle_indices = _build_document_sample_shuffle_indices
|
||||
BlendedMegatronDatasetBuilder._build_generic_dataset = _build_generic_dataset
|
||||
|
||||
|
||||
def set_moe_attr():
|
||||
|
|
|
@ -55,7 +55,7 @@ class TestConvertCkptFromHuggingface(unittest.TestCase):
|
|||
|
||||
# encoder has a common final_norm and each one has folliowing six layers
|
||||
weight_common_content['encoder'].pop('final_norm.weight')
|
||||
self.assertEqual(len(weight_common_content['encoder']) / 6, 32)
|
||||
self.assertEqual(len(weight_common_content['encoder']) / 10, 32)
|
||||
self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.query_key_value.weight'].size(), torch.Size([1536, 4096]))
|
||||
self.assertEqual(weight_common_content['encoder']['layers.0.self_attention.dense.weight'].size(), torch.Size([4096, 512]))
|
||||
self.assertEqual(weight_common_content['encoder']['layers.0.mlp.dense_h_to_4h.weight'].size(), torch.Size([2752, 4096]))
|
||||
|
|
Loading…
Reference in New Issue