diff --git a/README.md b/README.md
index 6d243ea..1c9c049 100644
--- a/README.md
+++ b/README.md
@@ -114,7 +114,7 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
7B1 |
1x8 |
FP16 |
- 2611 |
+ 2034 |
2525 |
训练 |
@@ -122,7 +122,7 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
176B |
12x8 |
BF16 |
- 112 |
+ 100 |
107 |
训练 |
@@ -232,8 +232,8 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
72B |
16x8 |
BF16 |
- -- |
- -- |
+ 285 |
+ 345 |
训练 |
@@ -417,8 +417,8 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
72B |
-- |
- -- |
- -- |
+ 对话 |
+ 评估 |
alpaca_data.json |
diff --git a/README_en.md b/README_en.md
index 693d4ab..5508115 100644
--- a/README_en.md
+++ b/README_en.md
@@ -112,7 +112,7 @@ Currently, the following downstream tasks have been supported:
7B1 |
1x8 |
FP16 |
- 2611 |
+ 2034 |
2525 |
Train |
@@ -120,7 +120,7 @@ Currently, the following downstream tasks have been supported:
176B |
12x8 |
BF16 |
- 112 |
+ 100 |
107 |
Train |
@@ -230,8 +230,8 @@ Currently, the following downstream tasks have been supported:
72B |
16x8 |
BF16 |
- -- |
- -- |
+ 285 |
+ 345 |
Train |
@@ -414,8 +414,8 @@ Currently, the following downstream tasks have been supported:
72B |
-- |
- -- |
- -- |
+ inference |
+ evaluation |
alpaca_data.json |
diff --git a/examples/qwen/README.md b/examples/qwen/README.md
index f987711..c99ecb2 100644
--- a/examples/qwen/README.md
+++ b/examples/qwen/README.md
@@ -28,6 +28,8 @@
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
+ - [推理](#推理)
+ - [评估](#评估)
# Qwen-7B
@@ -140,9 +142,9 @@ Qwen-7B 训练的硬件配置:
cd ..
```
-5. 微调
+5. 预训练
- 配置Qwen-7B 微调脚本: examples/qwen/pretrain_qwen_7b_ptd.sh
+ 配置Qwen-7B 预训练脚本: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
@@ -155,7 +157,7 @@ Qwen-7B 训练的硬件配置:
CKPT_LOAD_DIR="your megatron ckpt save path"
```
- 启动 Qwen-7B 微调脚本: examples/qwen/pretrain_qwen_7b_ptd.sh
+ 启动 Qwen-7B 预训练脚本: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_7b_ptd.sh
@@ -222,10 +224,10 @@ TASK="mmlu" # ceval任务配置为 "ceval"
bash tasks/evaluation/evaluate_qwen_7b_ptd.sh
```
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-|:---:|:---:|:---:|:---:|:---:|
-| CEval | 52 | 1346 | 63.5 | 62.5 |
-| MMLU | 57 | 14042 | 58.2 | 58.1 |
+| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
+|:---:|:---:|:---:|:-----------------------------------------------------------------------------:|:---:|
+| CEval | 52 | 1346 | [63.5](https://huggingface.co/Qwen/Qwen-7B) | 62.5 |
+| MMLU | 57 | 14042 | [58.2](https://huggingface.co/Qwen/Qwen-7B) | 58.1 |
# Qwen-14B
@@ -345,9 +347,9 @@ Qwen-14B 训练的硬件配置:
cd ..
```
-5. 微调
+5. 预训练
- 配置Qwen-14B 微调脚本: examples/qwen/pretrain_qwen_14b_ptd.sh
+ 配置Qwen-14B 预训练脚本: examples/qwen/pretrain_qwen_14b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
@@ -360,7 +362,7 @@ Qwen-14B 训练的硬件配置:
CKPT_LOAD_DIR="your megatron ckpt save path"
```
- 启动 Qwen-14B 微调脚本: examples/qwen/pretrain_qwen_14b_ptd.sh
+ 启动 Qwen-14B 预训练脚本: examples/qwen/pretrain_qwen_14b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_14b_ptd.sh
@@ -424,10 +426,10 @@ TASK="mmlu" # ceval任务配置为 "ceval"
bash tasks/evaluation/evaluate_qwen_14b_ptd.sh
```
-| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
-|:---:|:---:|:---:|:---:|:---:|
-| CEval | 52 | 1346 | 72.1 | 71.1 |
-| MMLU | 57 | 14042 | 66.3 | 65.3 |
+| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
+|:---:|:---:|:---:|:--------------------------------------------:|:---:|
+| CEval | 52 | 1346 | [72.1](https://huggingface.co/Qwen/Qwen-14B) | 71.1 |
+| MMLU | 57 | 14042 | [66.3](https://huggingface.co/Qwen/Qwen-14B) | 65.3 |
# Qwen-72B
@@ -436,9 +438,10 @@ bash tasks/evaluation/evaluate_qwen_14b_ptd.sh
Qwen-72B 训练的硬件配置:
-| 硬件 | 配置 |
-| :--: |:-----------------:|
-| NPU | 128 x Ascend NPUs |
+| 硬件 | 序列长度 | 配置 |
+|:---:|:----:|:-----------------:|
+| NPU | 8k | 64 x Ascend NPUs |
+| NPU | 32k | 320 x Ascend NPUs |
### 脚本
@@ -522,15 +525,15 @@ Qwen-72B 训练的硬件配置:
--tokenizer-name-or-path ../qwen-72b-hf \
--output-prefix ../dataset_qwen-72b/alpaca \
--tokenizer-type PretrainedFromHF \
- --seq-length 32768 \
+ --seq-length 8192 \
--workers 4 \
--log-interval 1000 \
cd ..
```
-5. 微调
+5. 预训练
- 配置Qwen-72B 微调脚本: examples/qwen/pretrain_qwen_72b_ptd.sh
+ 配置Qwen-72B 预训练脚本: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
@@ -542,8 +545,18 @@ Qwen-72B 训练的硬件配置:
DATA_PATH="./dataset_qwen-72b/alpaca_text_document" #数据集路径
CKPT_LOAD_DIR="your megatron ckpt save path"
```
+
+ 若使用32k长序列,需要开启重计算特性并修改seq-length参数值为32768,参数配置如下:
- 启动 Qwen-72B 微调脚本: examples/qwen/pretrain_qwen_72b_ptd.sh
+ ```shell
+ --seq-length 32768 \
+
+ --recompute-granularity full \
+ --recompute-method block \
+ --recompute-num-layers 2 \
+ ```
+
+ 启动 Qwen-72B 预训练脚本: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_72b_ptd.sh
@@ -558,4 +571,57 @@ Qwen-72B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
| 设备 | 模型 | tokens吞吐 (tokens/s/p)(8k序列) | tokens吞吐 (tokens/s/p)(32k序列) |
|:----:|:--------:|:-----------------------:|:-----------------------:|
| NPUs | Qwen-72B | 285 | -- |
-| 参考 | Qwen-72B | 345 | -- |
\ No newline at end of file
+| 参考 | Qwen-72B | 345 | -- |
+
+
+## 推理
+
+配置 qwen-72b 推理脚本:tasks/inference/generate_qwen_72b_ptd.sh
+
+```bash
+# ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# 修改模型权重路径和此表路径
+CHECKPOINT="your model directory path"
+TOKENIZER_PATH=./qwen-72b-hf
+```
+
+启动qwen-72b推理脚本
+```bash
+bash tasks/inference/generate_qwen_72b_ptd.sh
+```
+
+推理示例如下:
+
+
+
+## 评估
+
+使用[CEval数据集](https://huggingface.co/datasets/ceval/ceval-exam)和[MMLU数据集](https://huggingface.co/datasets/cais/mmlu)评估模型.
+
+配置qwen-72b评估脚本: tasks/evaluation/evaluate_qwen_72b_ptd.sh
+
+```bash
+# ascend-toolkit 路径
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# 修改模型参数路径和词表路径
+TOKENIZER_PATH=./qwen-72b-hf #词表路径
+CHECKPOINT="your model directory path" #模型路径
+
+# 配置任务和数据集路径
+DATA_PATH="./mmlu/data/test/" # ceval任务配置为 "./ceval/val/"
+TASK="mmlu" # ceval任务配置为 "ceval"
+```
+
+启动评估
+
+```bash
+bash tasks/evaluation/evaluate_qwen_72b_ptd.sh
+```
+
+| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
+|:---:|:---:|:---:|:--------------------------------------------:|:---:|
+| CEval | 52 | 1346 | [83.3](https://huggingface.co/Qwen/Qwen-72B) | 81.8 |
+| MMLU | 57 | 14042 | [77.4](https://huggingface.co/Qwen/Qwen-72B) | 74.6 |
\ No newline at end of file
diff --git a/examples/qwen/README_en.md b/examples/qwen/README_en.md
index 38008cc..0dcf2b6 100644
--- a/examples/qwen/README_en.md
+++ b/examples/qwen/README_en.md
@@ -27,6 +27,8 @@
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
+ - [Inference](#Inference)
+ - [Evaluation](#Evaluation)
# Qwen-7B
@@ -140,9 +142,9 @@ Here's a hardware summary of pre-training Qwen-7B:
cd ..
```
-5. fine-tuning
+5. pre-training
- Config Qwen-7B fine-tuning script: examples/qwen/pretrain_qwen_7b_ptd.sh
+ Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@@ -154,7 +156,7 @@ Here's a hardware summary of pre-training Qwen-7B:
CKPT_LOAD_DIR="your megatron ckpt save path"
```
- Launch Qwen-7B fine-tuning script: examples/qwen/pretrain_qwen_7b_ptd.sh
+ Launch Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_7b_ptd.sh
@@ -342,9 +344,9 @@ Here's a hardware summary of pre-training Qwen-14B:
cd ..
```
-5. fine-tuning
+5. pre-training
- Config Qwen-14B fine-tuning script: examples/qwen/pretrain_qwen_14b_ptd.sh
+ Config Qwen-14B pre-training script: examples/qwen/pretrain_qwen_14b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@@ -356,7 +358,7 @@ Here's a hardware summary of pre-training Qwen-14B:
CKPT_LOAD_DIR="your megatron ckpt save path"
```
- Launch Qwen-14B fine-tuning script: examples/qwen/pretrain_qwen_14b_ptd.sh
+ Launch Qwen-14B pre-training script: examples/qwen/pretrain_qwen_14b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_14b_ptd.sh
@@ -430,9 +432,10 @@ bash ./tasks/evaluation/evaluate_qwen_14b_ptd.sh
Here's a hardware summary of pre-training Qwen-72B:
-| Hardware | Value |
-| :------: |:-----------------:|
-| NPU | 128 x Ascend NPUs |
+| Hardware | Seq-length | Value |
+| :------: |:----------:|:-----------------:|
+| NPU | 8k | 64 x Ascend NPUs |
+| NPU | 32k | 320 x Ascend NPUs |
### Script
@@ -517,16 +520,16 @@ Here's a hardware summary of pre-training Qwen-72B:
--tokenizer-name-or-path ../qwen-72b-hf \
--output-prefix ../dataset_qwen-72b/alpaca \
--tokenizer-type PretrainedFromHF \
- --seq-length 32768 \
+ --seq-length 8192 \
--workers 4 \
--log-interval 1000 \
cd ..
```
-5. fine-tuning
+5. pre-training
- Config Qwen-72B fine-tuning script: examples/qwen/pretrain_qwen_72b_ptd.sh
+ Config Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@@ -537,8 +540,17 @@ Here's a hardware summary of pre-training Qwen-72B:
DATA_PATH="./dataset_qwen-72b/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="your megatron ckpt save path"
```
+
+ To use a 32K sequence, turn on the re-computation feature and change the value of seq-length to 32768. The parameter configuration is as follows:
+ ```shell
+ --seq-length 32768 \
- Launch Qwen-72B fine-tuning script: examples/qwen/pretrain_qwen_72b_ptd.sh
+ --recompute-granularity full \
+ --recompute-method block \
+ --recompute-num-layers 2 \
+ ```
+
+ Launch Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_72b_ptd.sh
@@ -554,3 +566,54 @@ The performance of Qwen-72B in **Ascend NPU** and **Reference**:
|:---------:|:-------:|:--------------------------------:|:---------------------------------:|
| NPUs | Qwen-7B | 285 | -- |
| Reference | Qwen-7B | 345 | -- |
+
+
+## Inference
+Config qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh
+
+```bash
+# ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# modify script model path and tokenizer path
+CHECKPOINT="your model directory path"
+TOKENIZER_PATH=./qwen-72b-hf
+```
+
+Launch qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh
+```bash
+bash tasks/inference/generate_qwen_72b_ptd.sh
+```
+
+Some inference samples are as follows:
+
+
+
+## Evaluation
+We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
+
+Config qwen-72b evaluation script: tasks/evaluation/evaluate_qwen_72b_ptd.sh
+
+```bash
+# ascend-toolkit path
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# Modify the model parameter path and vocabulary path
+TOKENIZER_PATH=./qwen-72b-hf # vocabulary path
+CHECKPOINT="your model directory path" # parameter path
+
+# Configure the task type and dataset path
+DATA_PATH="./mmlu/data/test/" # "./ceval/val/" for ceval task
+TASK="mmlu" # "ceval" for ceval task
+```
+
+Launch qwen-72b evaluation
+
+```bash
+bash ./tasks/evaluation/evaluate_qwen_72b_ptd.sh
+```
+
+| Task | Subset | Question | OpenSource | NPU |
+|:---:|:---:|:---:|:---:|:---:|
+| CEval | 52 | 1346 | 83.3 | 81.8 |
+| MMLU | 57 | 14042 | 77.4 | 74.6 |
\ No newline at end of file
diff --git a/examples/qwen/pretrain_qwen_14b_ptd.sh b/examples/qwen/pretrain_qwen_14b_ptd.sh
index e706527..a53a1fc 100644
--- a/examples/qwen/pretrain_qwen_14b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_14b_ptd.sh
@@ -43,7 +43,7 @@ GPT_ARGS="
--global-batch-size 64 \
--make-vocab-size-divisible-by 32 \
--lr 1.25e-6 \
- --train-iters 2000 \
+ --train-iters 1000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
diff --git a/examples/qwen/pretrain_qwen_72b_ptd.sh b/examples/qwen/pretrain_qwen_72b_ptd.sh
index dc4f7a3..8ecccb4 100644
--- a/examples/qwen/pretrain_qwen_72b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_72b_ptd.sh
@@ -1,11 +1,12 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
-NNODES=1
+NNODES=8
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
@@ -15,7 +16,7 @@ TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model load ckpt path"
TP=8
-PP=1
+PP=8
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
@@ -36,7 +37,7 @@ GPT_ARGS="
--tokenizer-type PretrainedFromHF \
--load ${CKPT_LOAD_DIR} \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
- --seq-length 32768 \
+ --seq-length 8192 \
--max-position-embeddings 32768 \
--micro-batch-size 1 \
--global-batch-size 16 \
diff --git a/examples/qwen/pretrain_qwen_7b_ptd.sh b/examples/qwen/pretrain_qwen_7b_ptd.sh
index a6763e8..9e86897 100644
--- a/examples/qwen/pretrain_qwen_7b_ptd.sh
+++ b/examples/qwen/pretrain_qwen_7b_ptd.sh
@@ -43,7 +43,7 @@ GPT_ARGS="
--global-batch-size 64 \
--make-vocab-size-divisible-by 16 \
--lr 1.25e-6 \
- --train-iters 2000 \
+ --train-iters 1000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
diff --git a/requirements.txt b/requirements.txt
index cf5f2e1..b030642 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -14,4 +14,5 @@ six
torch==2.1.0
torchvision==0.16.0
protobuf
-peft==0.7.1
\ No newline at end of file
+peft==0.7.1
+tiktoken
\ No newline at end of file
diff --git a/sources/images/qwen/qwen_72b_inference.png b/sources/images/qwen/qwen_72b_inference.png
new file mode 100644
index 0000000..c9e62ba
Binary files /dev/null and b/sources/images/qwen/qwen_72b_inference.png differ
diff --git a/tasks/evaluation/eval_impl/mmlu_eval.py b/tasks/evaluation/eval_impl/mmlu_eval.py
index adf8192..c27bd17 100644
--- a/tasks/evaluation/eval_impl/mmlu_eval.py
+++ b/tasks/evaluation/eval_impl/mmlu_eval.py
@@ -68,7 +68,7 @@ class MmluEval(DatasetEval):
chat_results, rank = chat.chat(instruction=instructions, history=[])
if chat_results:
for index, chat_result in enumerate(chat_results):
- answer = chat_result[0]
+ answer = chat_result[0].lstrip()
try:
if rank == 0:
logger.info(instruction)
diff --git a/tasks/evaluation/evaluate_qwen_72b_ptd.sh b/tasks/evaluation/evaluate_qwen_72b_ptd.sh
new file mode 100644
index 0000000..d0a3128
--- /dev/null
+++ b/tasks/evaluation/evaluate_qwen_72b_ptd.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+# The number of parameters is not aligned
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+export COMBINED_ENABLE=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+NPUS_PER_NODE=8
+
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
+
+CHECKPOINT="your model directory path"
+TOKENIZER_PATH="your tokenizer directory path"
+DATA_PATH="./mmlu/data/test"
+TASK="mmlu"
+
+# Different task needs different max_new_tokens value, please follow the instruction in readme.
+torchrun $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \
+ --task-data-path $DATA_PATH \
+ --task $TASK\
+ --seq-length 8192 \
+ --max-new-tokens 1 \
+ --max-position-embeddings 32768 \
+ --tensor-model-parallel-size 8 \
+ --pipeline-model-parallel-size 1 \
+ --num-layers 80 \
+ --hidden-size 8192 \
+ --ffn-hidden-size 24576 \
+ --num-attention-heads 64 \
+ --disable-bias-linear \
+ --swiglu \
+ --position-embedding-type rope \
+ --load ${CHECKPOINT} \
+ --normalization RMSNorm \
+ --tokenizer-type PretrainedFromHF \
+ --tokenizer-name-or-path ${TOKENIZER_PATH} \
+ --tokenizer-not-use-fast \
+ --bf16 \
+ --micro-batch-size 1 \
+ --exit-on-missing-checkpoint \
+ --no-load-rng \
+ --no-load-optim \
+ --untie-embeddings-and-output-weights \
+ --add-qkv-bias \
+ --tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
+ --make-vocab-size-divisible-by 64 \
+ --seed 42 | tee ./eval_qwen_72b_${TASK}.log
diff --git a/tasks/inference/generate_qwen_72b_ptd.sh b/tasks/inference/generate_qwen_72b_ptd.sh
new file mode 100644
index 0000000..8a7f0a2
--- /dev/null
+++ b/tasks/inference/generate_qwen_72b_ptd.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+
+# The number of parameters is not aligned
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+export COMBINED_ENABLE=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+# please fill these path configurations
+CHECKPOINT="your model directory path"
+TOKENIZER_PATH="your tokenizer path"
+
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+NPUS_PER_NODE=8
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
+
+torchrun $DISTRIBUTED_ARGS inference_qwen.py \
+ --tensor-model-parallel-size 8 \
+ --pipeline-model-parallel-size 1 \
+ --num-layers 80 \
+ --hidden-size 8192 \
+ --num-attention-heads 64 \
+ --ffn-hidden-size 24576 \
+ --max-position-embeddings 32768 \
+ --seq-length 8192 \
+ --make-vocab-size-divisible-by 64 \
+ --untie-embeddings-and-output-weights \
+ --micro-batch-size 1 \
+ --swiglu \
+ --disable-bias-linear \
+ --tokenizer-type PretrainedFromHF \
+ --tokenizer-name-or-path ${TOKENIZER_PATH} \
+ --load ${CHECKPOINT} \
+ --normalization RMSNorm \
+ --position-embedding-type rope \
+ --norm-epsilon 1e-6 \
+ --hidden-dropout 0 \
+ --attention-dropout 0 \
+ --tokenizer-not-use-fast \
+ --add-qkv-bias \
+ --rotary-base 1000000 \
+ --tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
+ --max-new-tokens 256 \
+ --seed 42 \
+ --bf16