!1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率

Merge pull request !1250 from liuyanghan/1.0.0
This commit is contained in:
liuyanghan 2024-04-29 01:34:46 +00:00 committed by i-robot
parent f33da62b90
commit 4ad901b67d
9 changed files with 46 additions and 45 deletions

View File

@ -232,11 +232,11 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
| 软件 | [版本](https://www.hiascend.com/zh/) | | 软件 | [版本](https://www.hiascend.com/zh/) |
| :-----------------------: |:----------------------------------:| | :-----------------------: |:----------------------------------:|
| Python | 3.8 | | Python | 3.8 |
| driver | Ascend HDK 23.0.0 | | driver | Ascend HDK 24.1.RC1 |
| firmware | Ascend HDK 23.0.0 | | firmware | Ascend HDK 24.1.RC1 |
| CANN | CANN 7.0.0 | | CANN | CANN 8.0.RC1 |
| torch | 2.1.0 | | torch | 2.1.0 |
| torch_npu | release v5.0.0 | | torch_npu | release v6.0.RC1 |
当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07) 当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

View File

@ -230,14 +230,14 @@ For the supported models listed above, we provide training scripts and readme in
【Please note the corresponding environment versions for model usage, as follows】 【Please note the corresponding environment versions for model usage, as follows】
| Software | [Version](https://www.hiascend.com/zh/) | | 软件 | [版本](https://www.hiascend.com/zh/) |
| :-----------------------: |:----------------------------------:| | :-----------------------: |:----------------------------------:|
| Python | 3.8 | | Python | 3.8 |
| driver | Ascend HDK 23.0.0 | | driver | Ascend HDK 24.1.RC1 |
| firmware | Ascend HDK 23.0.0 | | firmware | Ascend HDK 24.1.RC1 |
| CANN | CANN 7.0.0 | | CANN | CANN 8.0.RC1 |
| torch | 2.1.0 | | torch | 2.1.0 |
| torch_npu | release v5.0.0 | | torch_npu | release v6.0.RC1 |
The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07) The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

View File

@ -161,7 +161,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径 # 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-7b/" CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/" CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径 TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径 DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
``` ```
@ -313,7 +313,7 @@ python ./tools/preprocess_data.py \
source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径 # 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-65b/" CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #词表路径 TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径 DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
``` ```

View File

@ -162,7 +162,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path # modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-7b/" CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/" CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
``` ```
@ -312,7 +312,7 @@ python ./tools/preprocess_data.py \
source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path # modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-65b/" CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #tokenizer path TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
``` ```

View File

@ -353,12 +353,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
LLaMA-7B/13B 在 **昇腾芯片****参考芯片** 上的性能对比: LLaMA-7B/13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | | 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------| |------|-----------|-----------|------|--------------------|----------------------|-----------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 | | NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 | | 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 | | NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 | 213.29 | | 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 |
@ -705,7 +705,7 @@ NODE_RANK=0
```Shell ```Shell
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 | iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
time (ms) time (ms)
``` ```

View File

@ -347,12 +347,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**: The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**:
| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | | Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------| |-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 | | NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 | | Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 | | NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 | 213.29 | | Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 |
@ -686,7 +686,7 @@ The Training log will look like these:
```Shell ```Shell
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 | iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
time (ms) time (ms)
``` ```

View File

@ -253,10 +253,10 @@ python tools/checkpoint/util.py
LLaMA2-7B 在 **昇腾芯片****参考芯片** 上的性能对比: LLaMA2-7B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | | 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: | | :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 | 131.96 | | NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 |
| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 | | 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
## 推理-7B ## 推理-7B
@ -599,10 +599,10 @@ python tools/checkpoint/util.py \
LLaMA2-13B 在 **昇腾芯片****参考芯片** 上的性能对比: LLaMA2-13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | | 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | :-------------------: | | :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 | | NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- | | 参考 | LLaMA2-13B | -- | -- | 1750 | -- |
## 推理 ## 推理

View File

@ -272,10 +272,10 @@ Here's a hardware summary of pre-training LLAMA2-7B:
The performance of LLaMA2-7B in **Ascend NPU** and **Reference**: The performance of LLaMA2-7B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | | Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | :-----------------------------------: | | :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 | 122.39 | | NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 |
| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 | | Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
@ -648,10 +648,10 @@ Here's a hardware summary of pre-training LLaMA2-13B:
The performance of LLaMA2-13B in **Ascend NPU** and **Reference**: The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | | Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: | | :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 | | NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- | | Reference | LLaMA2-13B | -- | -- | 1750 | -- |
## Inference ## Inference

View File

@ -54,11 +54,12 @@ GPT_ARGS="
--no-masked-softmax-fusion \ --no-masked-softmax-fusion \
--attention-softmax-in-fp32 \ --attention-softmax-in-fp32 \
--min-lr 1.0e-7 \ --min-lr 1.0e-7 \
--weight-decay 1e-2 \ --weight-decay 0.1 \
--clip-grad 1.0 \ --clip-grad 1.0 \
--adam-beta1 0.9 \ --adam-beta1 0.9 \
--initial-loss-scale 4096.0 \ --initial-loss-scale 4096.0 \
--adam-beta2 0.999 \ --adam-beta2 0.95 \
--adam-eps 1e-5 \
--no-gradient-accumulation-fusion \ --no-gradient-accumulation-fusion \
--load ${CKPT_LOAD_DIR} \ --load ${CKPT_LOAD_DIR} \
--no-load-optim \ --no-load-optim \