!1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率

Merge pull request !1250 from liuyanghan/1.0.0
This commit is contained in:
liuyanghan 2024-04-29 01:34:46 +00:00 committed by i-robot
parent f33da62b90
commit 4ad901b67d
9 changed files with 46 additions and 45 deletions

View File

@ -232,11 +232,11 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
| 软件 | [版本](https://www.hiascend.com/zh/) |
| :-----------------------: |:----------------------------------:|
| Python | 3.8 |
| driver | Ascend HDK 23.0.0 |
| firmware | Ascend HDK 23.0.0 |
| CANN | CANN 7.0.0 |
| driver | Ascend HDK 24.1.RC1 |
| firmware | Ascend HDK 24.1.RC1 |
| CANN | CANN 8.0.RC1 |
| torch | 2.1.0 |
| torch_npu | release v5.0.0 |
| torch_npu | release v6.0.RC1 |
当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

View File

@ -230,14 +230,14 @@ For the supported models listed above, we provide training scripts and readme in
【Please note the corresponding environment versions for model usage, as follows】
| Software | [Version](https://www.hiascend.com/zh/) |
| 软件 | [版本](https://www.hiascend.com/zh/) |
| :-----------------------: |:----------------------------------:|
| Python | 3.8 |
| driver | Ascend HDK 23.0.0 |
| firmware | Ascend HDK 23.0.0 |
| CANN | CANN 7.0.0 |
| driver | Ascend HDK 24.1.RC1 |
| firmware | Ascend HDK 24.1.RC1 |
| CANN | CANN 8.0.RC1 |
| torch | 2.1.0 |
| torch_npu | release v5.0.0 |
| torch_npu | release v6.0.RC1 |
The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

View File

@ -161,7 +161,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
```
@ -313,7 +313,7 @@ python ./tools/preprocess_data.py \
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #词表路径
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
```

View File

@ -162,7 +162,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
```
@ -312,7 +312,7 @@ python ./tools/preprocess_data.py \
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #tokenizer path
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
```

View File

@ -353,12 +353,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
LLaMA-7B/13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 |
| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 |
| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 | 213.29 |
| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|-----------|-----------|------|--------------------|----------------------|-----------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 |
@ -705,7 +705,7 @@ NODE_RANK=0
```Shell
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 |
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
time (ms)
```

View File

@ -347,12 +347,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**:
| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 |
| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 |
| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 | 213.29 |
| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 |
@ -686,7 +686,7 @@ The Training log will look like these:
```Shell
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 |
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
time (ms)
```

View File

@ -253,10 +253,10 @@ python tools/checkpoint/util.py
LLaMA2-7B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 | 131.96 |
| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 |
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 |
| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
## 推理-7B
@ -599,10 +599,10 @@ python tools/checkpoint/util.py \
LLaMA2-13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | :-------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 |
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- |
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- |
## 推理

View File

@ -272,10 +272,10 @@ Here's a hardware summary of pre-training LLAMA2-7B:
The performance of LLaMA2-7B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | :-----------------------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 | 122.39 |
| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 |
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: |
| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 |
| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
@ -648,10 +648,10 @@ Here's a hardware summary of pre-training LLaMA2-13B:
The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 |
| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- |
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: |
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
| Reference | LLaMA2-13B | -- | -- | 1750 | -- |
## Inference

View File

@ -54,11 +54,12 @@ GPT_ARGS="
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.0e-7 \
--weight-decay 1e-2 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--initial-loss-scale 4096.0 \
--adam-beta2 0.999 \
--adam-beta2 0.95 \
--adam-eps 1e-5 \
--no-gradient-accumulation-fusion \
--load ${CKPT_LOAD_DIR} \
--no-load-optim \