!1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率
Merge pull request !1250 from liuyanghan/1.0.0
This commit is contained in:
parent
f33da62b90
commit
4ad901b67d
|
@ -232,11 +232,11 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
|
|||
| 软件 | [版本](https://www.hiascend.com/zh/) |
|
||||
| :-----------------------: |:----------------------------------:|
|
||||
| Python | 3.8 |
|
||||
| driver | Ascend HDK 23.0.0 |
|
||||
| firmware | Ascend HDK 23.0.0 |
|
||||
| CANN | CANN 7.0.0 |
|
||||
| driver | Ascend HDK 24.1.RC1 |
|
||||
| firmware | Ascend HDK 24.1.RC1 |
|
||||
| CANN | CANN 8.0.RC1 |
|
||||
| torch | 2.1.0 |
|
||||
| torch_npu | release v5.0.0 |
|
||||
| torch_npu | release v6.0.RC1 |
|
||||
|
||||
当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)
|
||||
|
||||
|
|
10
README_en.md
10
README_en.md
|
@ -230,14 +230,14 @@ For the supported models listed above, we provide training scripts and readme in
|
|||
|
||||
【Please note the corresponding environment versions for model usage, as follows】
|
||||
|
||||
| Software | [Version](https://www.hiascend.com/zh/) |
|
||||
| 软件 | [版本](https://www.hiascend.com/zh/) |
|
||||
| :-----------------------: |:----------------------------------:|
|
||||
| Python | 3.8 |
|
||||
| driver | Ascend HDK 23.0.0 |
|
||||
| firmware | Ascend HDK 23.0.0 |
|
||||
| CANN | CANN 7.0.0 |
|
||||
| driver | Ascend HDK 24.1.RC1 |
|
||||
| firmware | Ascend HDK 24.1.RC1 |
|
||||
| CANN | CANN 8.0.RC1 |
|
||||
| torch | 2.1.0 |
|
||||
| torch_npu | release v5.0.0 |
|
||||
| torch_npu | release v6.0.RC1 |
|
||||
|
||||
The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)
|
||||
|
||||
|
|
|
@ -161,7 +161,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
|||
# 修改数据集,词表,权重等路径
|
||||
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
|
||||
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
|
||||
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
|
||||
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
|
||||
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
|
||||
```
|
||||
|
||||
|
@ -313,7 +313,7 @@ python ./tools/preprocess_data.py \
|
|||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
# 修改数据集,词表,权重等路径
|
||||
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
|
||||
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #词表路径
|
||||
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
|
||||
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
|
||||
```
|
||||
|
||||
|
|
|
@ -162,7 +162,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
|||
# modify script orign dataset path according to your own dataset path
|
||||
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
|
||||
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
|
||||
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
|
||||
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
|
||||
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
|
||||
```
|
||||
|
||||
|
@ -312,7 +312,7 @@ python ./tools/preprocess_data.py \
|
|||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
# modify script orign dataset path according to your own dataset path
|
||||
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
|
||||
TOKENIZER_PATH="./model_from_hf/internlm-65b/" #tokenizer path
|
||||
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path
|
||||
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
|
||||
```
|
||||
|
||||
|
|
|
@ -353,12 +353,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
|
|||
|
||||
LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
|
||||
|
||||
| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|
||||
|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------|
|
||||
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 |
|
||||
| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 |
|
||||
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 |
|
||||
| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 | 213.29 |
|
||||
| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|
||||
|------|-----------|-----------|------|--------------------|----------------------|-----------------|
|
||||
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
|
||||
| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
|
||||
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
|
||||
| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 |
|
||||
|
||||
|
||||
|
||||
|
@ -705,7 +705,7 @@ NODE_RANK=0
|
|||
|
||||
```Shell
|
||||
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
|
||||
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 |
|
||||
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
|
||||
time (ms)
|
||||
```
|
||||
|
||||
|
|
|
@ -347,12 +347,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh
|
|||
|
||||
The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**:
|
||||
|
||||
| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|
||||
|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------|
|
||||
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 |
|
||||
| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 |
|
||||
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 |
|
||||
| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 | 213.29 |
|
||||
| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|
||||
|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|
|
||||
| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 |
|
||||
| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 |
|
||||
| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 |
|
||||
| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 |
|
||||
|
||||
|
||||
|
||||
|
@ -686,7 +686,7 @@ The Training log will look like these:
|
|||
|
||||
```Shell
|
||||
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
|
||||
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 |
|
||||
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 |
|
||||
time (ms)
|
||||
```
|
||||
|
||||
|
|
|
@ -253,10 +253,10 @@ python tools/checkpoint/util.py
|
|||
|
||||
LLaMA2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
|
||||
|
||||
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|
||||
| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
|
||||
| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 | 131.96 |
|
||||
| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 |
|
||||
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
|
||||
| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: |
|
||||
| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 |
|
||||
| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
|
||||
|
||||
## 推理-7B
|
||||
|
||||
|
@ -599,10 +599,10 @@ python tools/checkpoint/util.py \
|
|||
|
||||
LLaMA2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
|
||||
|
||||
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|
||||
| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | :-------------------: |
|
||||
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 |
|
||||
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- |
|
||||
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
|
||||
| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: |
|
||||
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
|
||||
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- |
|
||||
|
||||
## 推理
|
||||
|
||||
|
|
|
@ -272,10 +272,10 @@ Here's a hardware summary of pre-training LLAMA2-7B:
|
|||
|
||||
The performance of LLaMA2-7B in **Ascend NPU** and **Reference**:
|
||||
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|
||||
| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | :-----------------------------------: |
|
||||
| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 | 122.39 |
|
||||
| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 |
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|
||||
| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: |
|
||||
| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 |
|
||||
| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 |
|
||||
|
||||
|
||||
|
||||
|
@ -648,10 +648,10 @@ Here's a hardware summary of pre-training LLaMA2-13B:
|
|||
|
||||
The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
|
||||
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|
||||
| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
|
||||
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 |
|
||||
| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- |
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|
||||
| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: |
|
||||
| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 |
|
||||
| Reference | LLaMA2-13B | -- | -- | 1750 | -- |
|
||||
|
||||
## Inference
|
||||
|
||||
|
|
|
@ -54,11 +54,12 @@ GPT_ARGS="
|
|||
--no-masked-softmax-fusion \
|
||||
--attention-softmax-in-fp32 \
|
||||
--min-lr 1.0e-7 \
|
||||
--weight-decay 1e-2 \
|
||||
--weight-decay 0.1 \
|
||||
--clip-grad 1.0 \
|
||||
--adam-beta1 0.9 \
|
||||
--initial-loss-scale 4096.0 \
|
||||
--adam-beta2 0.999 \
|
||||
--adam-beta2 0.95 \
|
||||
--adam-eps 1e-5 \
|
||||
--no-gradient-accumulation-fusion \
|
||||
--load ${CKPT_LOAD_DIR} \
|
||||
--no-load-optim \
|
||||
|
|
Loading…
Reference in New Issue