!1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率

Merge pull request !1250 from liuyanghan/1.0.0
2024-04-29 01:34:46 +00:00 · 2024-04-29 01:34:46 +00:00 · 4ad901b67d
parent f33da62b90
commit 4ad901b67d
9 changed files with 46 additions and 45 deletions
--- a/README.md
+++ b/README.md
@ -232,11 +232,11 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
 |           软件            | [版本](https://www.hiascend.com/zh/) |
 | :-----------------------: |:----------------------------------:|
 |          Python           |                3.8                 |
-|          driver           |              Ascend HDK 23.0.0              |
-|         firmware          |              Ascend HDK 23.0.0              |
-|           CANN            |              CANN 7.0.0              |
+|          driver           |              Ascend HDK 24.1.RC1              |
+|         firmware          |              Ascend HDK 24.1.RC1              |
+|           CANN            |              CANN 8.0.RC1              |
 |           torch           |               2.1.0                |
-|         torch_npu         |              release v5.0.0               |
+|         torch_npu         |              release v6.0.RC1               |

 当前仓库使用的megatron commitId为[bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

--- a/README_en.md
+++ b/README_en.md
@ -230,14 +230,14 @@ For the supported models listed above, we provide training scripts and readme in

 【Please note the corresponding environment versions for model usage, as follows】

-|           Software            | [Version](https://www.hiascend.com/zh/) |
+|           软件            | [版本](https://www.hiascend.com/zh/) |
 | :-----------------------: |:----------------------------------:|
 |          Python           |                3.8                 |
-|          driver           |              Ascend HDK 23.0.0              |
-|         firmware          |              Ascend HDK 23.0.0              |
-|           CANN            |              CANN 7.0.0              |
+|          driver           |              Ascend HDK 24.1.RC1              |
+|         firmware          |              Ascend HDK 24.1.RC1              |
+|           CANN            |              CANN 8.0.RC1              |
 |           torch           |               2.1.0                |
-|         torch_npu         |              release v5.0.0               |
+|         torch_npu         |              release v6.0.RC1               |

 The current repository uses Megatron commitId [bcce6f54](https://github.com/NVIDIA/Megatron-LM/tree/bcce6f54e075e3c3374ea67adefe54f3f2da2b07)

--- a/examples/intern/README.md
+++ b/examples/intern/README.md
@ -161,7 +161,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 # 修改数据集，词表，权重等路径
 CKPT_SAVE_DIR="./ckpt/internlm-7b/"
 CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
+TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
 DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
 ```

@ -313,7 +313,7 @@ python ./tools/preprocess_data.py \
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # 修改数据集，词表，权重等路径
 CKPT_SAVE_DIR="./ckpt/internlm-65b/"
-TOKENIZER_PATH="./model_from_hf/internlm-65b/" #词表路径
+TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
 DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
 ```

--- a/examples/intern/README_en.md
+++ b/examples/intern/README_en.md
@ -162,7 +162,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 # modify script orign dataset path according to your own dataset path
 CKPT_SAVE_DIR="./ckpt/internlm-7b/"
 CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
-TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model"  #tokenizer path
+TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model"  #tokenizer path
 DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
 ```

@ -312,7 +312,7 @@ python ./tools/preprocess_data.py \
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # modify script orign dataset path according to your own dataset path
 CKPT_SAVE_DIR="./ckpt/internlm-65b/"
-TOKENIZER_PATH="./model_from_hf/internlm-65b/"  #tokenizer path
+TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model"  #tokenizer path
 DATA_PATH="./dataset/internlm-65b/alpaca_text_document"  #processed dataset
 ```

--- a/examples/llama/README.md
+++ b/examples/llama/README.md
@ -353,12 +353,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh

 LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

-| 设备   | 硬件        | 模型        | 迭代数  | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-|------|-----------|-----------|------|--------------------|----------------------|-----------------|------------------|
-| NPUs | 910 1*8p  | LLaMA-7B  | 2048 | 1.75               | 3600                 | 18.2            | 159.9            |
-| 参考 | -         | LLaMA-7B  | 2048 | 1.85               | 3804                 | 18.5            | 161.5            |
-| NPUs | 910 1*8p  | LLaMA-13B | 2048 | 0.92               | 1895                 | 17.2            | 200.57           |
-| 参考 | -         | LLaMA-13B | 2048 | 0.96               | 2012                 | 16.65           | 213.29           |
+| 设备   | 硬件        | 模型        | 迭代数  | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
+|------|-----------|-----------|------|--------------------|----------------------|-----------------|
+| NPUs | 910 1*8p  | LLaMA-7B  | 2048 | 1.75               | 3600                 | 18.2            |
+| 参考 | -         | LLaMA-7B  | 2048 | 1.85               | 3804                 | 18.5            |
+| NPUs | 910 1*8p  | LLaMA-13B | 2048 | 0.92               | 1895                 | 17.2            |
+| 参考 | -         | LLaMA-13B | 2048 | 0.96               | 2012                 | 16.65           |



@ -705,7 +705,7 @@ NODE_RANK=0

 ```Shell
 iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
-iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
+iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 |
 time (ms)
 ```

--- a/examples/llama/README_en.md
+++ b/examples/llama/README_en.md
@ -347,12 +347,12 @@ bash tasks/finetune/tune_llama_13b_ptd.sh

 The performance of LLaMA-7B/13B in **Ascend NPU** and **Reference**:

-| Device    | Hardware  | Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|-------------------------------------|
-| NPUs      | 910 1*8p  | LLaMA-7B  | 2048             | 1.75                          | 3600                         | 18.2                      | 159.9                               |
-| Reference | -         | LLaMA-7B  | 2048             | 1.85                          | 3804                         | 18.5                      | 161.5                               |
-| NPUs      | 910 1*8p  | LLaMA-13B | 2048             | 0.92                         | 1895                         | 17.2                     | 200.57                              |
-| Reference | -         | LLaMA-13B | 2048             | 0.96                          | 2012                         | 16.6                     | 213.29                              |
+| Device    | Hardware  | Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
+|-----------|-----------|-----------|------------------|-------------------------------|------------------------------|---------------------------|
+| NPUs      | 910 1*8p  | LLaMA-7B  | 2048             | 1.75                          | 3600                         | 18.2                      |
+| Reference | -         | LLaMA-7B  | 2048             | 1.85                          | 3804                         | 18.5                      |
+| NPUs      | 910 1*8p  | LLaMA-13B | 2048             | 0.92                         | 1895                         | 17.2                     |
+| Reference | -         | LLaMA-13B | 2048             | 0.96                          | 2012                         | 16.6                     |



@ -686,7 +686,7 @@ The Training log will look like these:

 ```Shell
 iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
-iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
+iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 |
 time (ms)
 ```

--- a/examples/llama2/README.md
+++ b/examples/llama2/README.md
@ -253,10 +253,10 @@ python tools/checkpoint/util.py

 LLaMA2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

-| 设备 |   模型   | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
-| NPUs | LLaMA2-7B |  1024  |          5.63          |          2730          |         2.84         |        131.96        |
-| 参考 | LLaMA2-7B |  1024  |          5.63          |          2884          |         2.84         |        131.96        |
+| 设备 |   模型   | 迭代数 | 样本吞吐 (samples/step) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
+| :--: | :-------: | :----: | :---------------------: | :---------------------: | :-------------------: |
+| NPUs | LLaMA2-7B |  1024  |          5.63          |          2730          |         2.84         |
+| 参考 | LLaMA2-7B |  1024  |          5.63          |          2884          |         2.84         |

 ## 推理-7B

@ -599,10 +599,10 @@ python tools/checkpoint/util.py \

 LLaMA2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：

-| 设备 |    模型    | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
-| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: | :-------------------: |
-| NPUs | LLaMA2-13B |  5000  |         3.027         |          1550          |         5.285         |        133.77        |
-| 参考 | LLaMA2-13B |   --   |           --           |          1750          |          --          |          --          |
+| 设备 |    模型    | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
+| :--: | :--------: | :----: | :--------------------: | :---------------------: | :-------------------: |
+| NPUs | LLaMA2-13B |  5000  |         3.027         |          1550          |         5.285         |
+| 参考 | LLaMA2-13B |   --   |           --           |          1750          |          --          |

 ## 推理

--- a/examples/llama2/README_en.md
+++ b/examples/llama2/README_en.md
@ -272,10 +272,10 @@ Here's a hardware summary of pre-training  LLAMA2-7B:

 The performance of LLaMA2-7B in **Ascend NPU** and **Reference**:

-| Device | Model       | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: | :-----------------------------------: |
-| NPUs   | LLaMA2-7B | 1024             | 5.19                      | 2730                      | 3.08                   | 122.39                         |
-| Reference   | LLaMA2-7B | 1024             | 5.63                      | 2884                       | 2.84                   | 131.96                         |
+| Device | Model       | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
+| :------: | :-----------: | :----------------: | :-----------------------------: | :----------------------------: | :-------------------------: |
+| NPUs   | LLaMA2-7B | 1024             | 5.19                      | 2730                      | 3.08                   |
+| Reference   | LLaMA2-7B | 1024             | 5.63                      | 2884                       | 2.84                   |



@ -648,10 +648,10 @@ Here's a hardware summary of pre-training  LLaMA2-13B:

 The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:

-|  Device  |   Model   | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
-| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
-|   NPUs   | LLaMA2-13B |       5000       |             3.027             |             1550             |           5.285           |               133.77               |
-| Reference | LLaMA2-13B |        --        |              --              |             1750             |            --            |                 --                 |
+|  Device  |   Model   | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
+| :-------: | :--------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: |
+|   NPUs   | LLaMA2-13B |       5000       |             3.027             |             1550             |           5.285           |
+| Reference | LLaMA2-13B |        --        |              --              |             1750             |            --            |

 ## Inference

--- a/examples/llama2/pretrain_llama2_70b_ptd.sh
+++ b/examples/llama2/pretrain_llama2_70b_ptd.sh
@ -54,11 +54,12 @@ GPT_ARGS="
    --no-masked-softmax-fusion \
    --attention-softmax-in-fp32 \
    --min-lr 1.0e-7 \
-    --weight-decay 1e-2 \
+    --weight-decay 0.1 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
    --initial-loss-scale 4096.0 \
-    --adam-beta2 0.999 \
+    --adam-beta2 0.95 \
+    --adam-eps 1e-5 \
    --no-gradient-accumulation-fusion \
    --load ${CKPT_LOAD_DIR}  \
    --no-load-optim \