diff --git a/README.md b/README.md index 283c228..051319f 100644 --- a/README.md +++ b/README.md @@ -232,11 +232,11 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz | 软件 | [版本](https://www.hiascend.com/zh/) | | :-----------------------: |:----------------------------------:| | Python | 3.8 | -| driver | Ascend HDK 23.0.0 | -| firmware | Ascend HDK 23.0.0 | -| CANN | CANN 7.0.0 | +| driver | Ascend HDK 24.1.RC1 | +| firmware | Ascend HDK 24.1.RC1 | +| CANN | CANN 8.0.RC1 | | torch | 2.1.0 | -| torch_npu | release v5.0.0 | +| torch_npu | release v6.0.RC1 | 当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07) diff --git a/README_en.md b/README_en.md index f4d3363..98de898 100644 --- a/README_en.md +++ b/README_en.md @@ -230,14 +230,14 @@ For the supported models listed above, we provide training scripts and readme in 【Please note the corresponding environment versions for model usage, as follows】 -| Software | [Version](https://www.hiascend.com/zh/) | +| 软件 | [版本](https://www.hiascend.com/zh/) | | :-----------------------: |:----------------------------------:| | Python | 3.8 | -| driver | Ascend HDK 23.0.0 | -| firmware | Ascend HDK 23.0.0 | -| CANN | CANN 7.0.0 | +| driver | Ascend HDK 24.1.RC1 | +| firmware | Ascend HDK 24.1.RC1 | +| CANN | CANN 8.0.RC1 | | torch | 2.1.0 | -| torch_npu | release v5.0.0 | +| torch_npu | release v6.0.RC1 | The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07) diff --git a/examples/intern/README.md b/examples/intern/README.md index 85c5827..8a70be6 100644 --- a/examples/intern/README.md +++ b/examples/intern/README.md @@ -161,7 +161,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh # 修改数据集,词表,权重等路径 CKPT_SAVE_DIR="./ckpt/internlm-7b/" CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/" -TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径 +TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径 DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径 ``` @@ -313,7 +313,7 @@ python ./tools/preprocess_data.py \ source /usr/local/Ascend/ascend-toolkit/set_env.sh # 修改数据集,词表,权重等路径 CKPT_SAVE_DIR="./ckpt/internlm-65b/" -TOKENIZER_PATH="./model_from_hf/internlm-65b/" #词表路径 +TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径 DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径 ``` diff --git a/examples/intern/README_en.md b/examples/intern/README_en.md index d2898c4..ad76b1e 100644 --- a/examples/intern/README_en.md +++ b/examples/intern/README_en.md @@ -162,7 +162,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh # modify script orign dataset path according to your own dataset path CKPT_SAVE_DIR="./ckpt/internlm-7b/" CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/" -TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path +TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset ``` @@ -312,7 +312,7 @@ python ./tools/preprocess_data.py \ source /usr/local/Ascend/ascend-toolkit/set_env.sh # modify script orign dataset path according to your own dataset path CKPT_SAVE_DIR="./ckpt/internlm-65b/" -TOKENIZER_PATH="./model_from_hf/internlm-65b/" #tokenizer path +TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset ``` diff --git a/examples/llama/README.md b/examples/llama/README.md index 6c8579a..411a487 100644 --- a/examples/llama/README.md +++ b/examples/llama/README.md @@ -353,12 +353,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比: -| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | -|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------| -| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 | -| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 | -| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 | -| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 | 213.29 | +| 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | +|------|-----------|-----------|------|--------------------|----------------------|-----------------| +| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | +| 参考 | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | +| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | +| 参考 | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.65 | @@ -705,7 +705,7 @@ NODE_RANK=0 ```Shell iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped -iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 | +iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | time (ms) ``` diff --git a/examples/llama/README_en.md b/examples/llama/README_en.md index 778f891..4454913 100644 --- a/examples/llama/README_en.md +++ b/examples/llama/README_en.md @@ -347,12 +347,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**: -| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | -|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------| -| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | 159.9 | -| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | 161.5 | -| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | 200.57 | -| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 | 213.29 | +| Device | Hardware | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | +|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------| +| NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.75 | 3600 | 18.2 | +| Reference | - | LLaMA-7B | 2048 | 1.85 | 3804 | 18.5 | +| NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.92 | 1895 | 17.2 | +| Reference | - | LLaMA-13B | 2048 | 0.96 | 2012 | 16.6 | @@ -686,7 +686,7 @@ The Training log will look like these: ```Shell iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped -iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 | +iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | time (ms) ``` diff --git a/examples/llama2/README.md b/examples/llama2/README.md index f6e6687..4bbb3ff 100755 --- a/examples/llama2/README.md +++ b/examples/llama2/README.md @@ -253,10 +253,10 @@ python tools/checkpoint/util.py LLaMA2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比: -| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | -| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: | -| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 | 131.96 | -| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 | +| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | +| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | +| NPUs | LLaMA2-7B | 1024 | 5.63 | 2730 | 2.84 | +| 参考 | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | ## 推理-7B @@ -599,10 +599,10 @@ python tools/checkpoint/util.py \ LLaMA2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比: -| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | -| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | :-------------------: | -| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 | -| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- | +| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | +| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | +| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | +| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | ## 推理 diff --git a/examples/llama2/README_en.md b/examples/llama2/README_en.md index 0606c89..71911a0 100644 --- a/examples/llama2/README_en.md +++ b/examples/llama2/README_en.md @@ -272,10 +272,10 @@ Here's a hardware summary of pre-training LLAMA2-7B: The performance of LLaMA2-7B in **Ascend NPU** and **Reference**: -| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | -| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | :-----------------------------------: | -| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 | 122.39 | -| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | 131.96 | +| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | +| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | +| NPUs | LLaMA2-7B | 1024 | 5.19 | 2730 | 3.08 | +| Reference | LLaMA2-7B | 1024 | 5.63 | 2884 | 2.84 | @@ -648,10 +648,10 @@ Here's a hardware summary of pre-training LLaMA2-13B: The performance of LLaMA2-13B in **Ascend NPU** and **Reference**: -| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) | -| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: | -| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | 133.77 | -| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- | +| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | +| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | +| NPUs | LLaMA2-13B | 5000 | 3.027 | 1550 | 5.285 | +| Reference | LLaMA2-13B | -- | -- | 1750 | -- | ## Inference diff --git a/examples/llama2/pretrain_llama2_70b_ptd.sh b/examples/llama2/pretrain_llama2_70b_ptd.sh index cec37aa..800f9b0 100644 --- a/examples/llama2/pretrain_llama2_70b_ptd.sh +++ b/examples/llama2/pretrain_llama2_70b_ptd.sh @@ -54,11 +54,12 @@ GPT_ARGS=" --no-masked-softmax-fusion \ --attention-softmax-in-fp32 \ --min-lr 1.0e-7 \ - --weight-decay 1e-2 \ + --weight-decay 0.1 \ --clip-grad 1.0 \ --adam-beta1 0.9 \ --initial-loss-scale 4096.0 \ - --adam-beta2 0.999 \ + --adam-beta2 0.95 \ + --adam-eps 1e-5 \ --no-gradient-accumulation-fusion \ --load ${CKPT_LOAD_DIR} \ --no-load-optim \