Compare commits

..

19 Commits
master ... 1.0

Author SHA1 Message Date
WuSheYu03 65e60561fe
测试PR2024
Signed-off-by: WuSheYu03 <2893251844@qq.com>
2024-07-04 12:13:39 +00:00
SheYuWu03 3c86007ab1 测试PR 2024-07-03 19:38:11 +08:00
fengliangjun 27485bba55 !1376 将分支名从1.0.0改为1.0
Merge pull request !1376 from fengliangjun/1.0.0
2024-06-24 13:29:15 +00:00
glhyy c9fa42a3b3 !1362 增加预训练非共享储存情况下非主节点数据缓存检测和生成
Merge pull request !1362 from glhyy/1.0.0
2024-06-24 07:11:13 +00:00
sunjunjie 8f4f1079c9 !1272 修改AscendSpeed为MindSpeed
Merge pull request !1272 from sunjunjie/1.0.0
2024-06-03 02:13:09 +00:00
fengliangjun 5e4267a1e6 !1301 Mixtral-Moe模型更新至32K
Merge pull request !1301 from fengliangjun/1.0.0
2024-05-23 01:09:13 +00:00
guoxinjie 9dba65cfcb !1277 订正性能
Merge pull request !1277 from guoxinjie/1.0.0
2024-05-13 03:43:01 +00:00
wucong 331ce89ed1 !1266 对于已经测试过的模型打 【昇腾贡献模型】标签
Merge pull request !1266 from wucong/dev1v1
2024-05-07 02:21:14 +00:00
黄宇豪 0b4fb8b645 !1245 fix: 修复 1.0.0 llama2-7B README 评估脚本路径
Merge pull request !1245 from 黄宇豪/1.0.0
2024-05-07 01:44:35 +00:00
liuyanghan 4ad901b67d !1250 修改仓库配套信息 & 更新readme说明 & 调整llama2-70B学习率
Merge pull request !1250 from liuyanghan/1.0.0
2024-04-29 01:34:46 +00:00
glhyy f33da62b90 !1234 README已知问题更新(1.0.0)
Merge pull request !1234 from glhyy/1.0.0
2024-04-16 02:22:42 +00:00
guoxinjie 4338fa3467 !1226 Q1 分支移除 megatron
Merge pull request !1226 from guoxinjie/24Q1
2024-04-15 02:07:25 +00:00
guoxinjie 2cdee76040 !1221 Q1 分支指定加速库 commit ID
Merge pull request !1221 from guoxinjie/24Q1
2024-04-07 06:47:28 +00:00
黄宇豪 3a544f320b !1216 统一权重路径和README样式 Merge pull request !1186 from 黄宇豪/master
Merge pull request !1216 from 黄宇豪/24Q1
2024-04-02 07:33:15 +00:00
glhyy 8fe29aba96 !1208 README中baichuan2链接修正
Merge pull request !1208 from glhyy/24Q1
2024-04-01 08:42:58 +00:00
guoxinjie da5a204b70 !1203 修复分布式优化器权重保存bug
Merge pull request !1203 from guoxinjie/master
2024-04-01 02:26:10 +00:00
xiongliangcheng 56a707dd3c !1156 修改readme中的错误
Merge pull request !1156 from xiongliangcheng/master
2024-03-28 01:15:04 +00:00
liuyanghan a68fac8224 !1179 多机训练下,数据加载问题说明
Merge pull request !1179 from liuyanghan/24Q1
2024-03-28 01:06:53 +00:00
shengjy ebe2f95e27 !1165 llama2 readme修改,更正tokenizer说明
* llama2 readme update
2024-03-27 08:25:18 +00:00
420 changed files with 8685 additions and 30050 deletions

9
.gitignore vendored
View File

@ -142,12 +142,3 @@ cython_debug/
# pycharm stuff
.idea
# megatron core
/megatron/
# User stuff
/kernel*/
/logs/
/model_from_hf/
/model_weights/

51
OWNERS
View File

@ -1,23 +1,44 @@
approvers:
- fengliangjun66
- wujianping996
- gp513
- dilililiwhy
reviewers:
- wujianping996
- fengliangjun66
- guhangsong
- leizhenzhen
- liuyanghan
- jzh
- guo-xinjie-1
- wucong0509
- dhrhank
- glhyy
- RyanAlexander
- sunjunjie1587
- dingzicha
- zhangjianxiang4
- guozhihua9814
- ckirchhoff
- block0219
- wqh17101
- timeash
reviewers:
- gp513
- guhangsong
- sanjaychan
- fengliangjun66
- mhh001
- huangyiming123
- zhangshengdong
- kingsleyandher
- guo-xinjie-1
- matrixssy
- ckirchhoff
- block0219
- wqh17101
- ZeroFadeAway
- Ronnie_zheng
- sibylk
- dl9864
- demo-v
- xulianhao
- yangyifei626
- z2fan
- wenjiang2357
- leizhenzhen
- liuyanghan
- Ares_Lzk
- flying-artillery
- xiong-liangcheng_admin
- gitee-code-template
- yaojia2021
- chantcalf
- kongfuziyue
- yuhui69
- guoyiwei111

1271
README.md

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,70 +0,0 @@
# 安全声明
## 系统安全加固
1. 建议用户在系统中配置开启ASLR级别2又称**全随机地址空间布局随机化**,可参考以下方式进行配置:
```
echo 2 > /proc/sys/kernel/randomize_va_space
```
## 运行用户建议
出于安全性及权限最小化角度考虑不建议使用root等管理员类型账户使用Modellink。
## 文件权限控制
1. 建议用户在主机包括宿主机及容器中设置运行系统umask值为0027及以上保障新增文件夹默认最高权限为750新增文件默认最高权限为640。
2. 建议用户对个人数据、商业资产、源文件、训练过程中保存的各类文件等敏感内容做好权限管控。涉及场景如Modellink安装目录权限管控、多用户使用共享数据集权限管控管控权限可参考表1进行设置。
3. Modellink在数据预处理中会生成训练数据在训练过程会生成权重文件文件权限默认640用户可根据实际需求对生成文件权限进行进阶管控。
**表1 文件(夹)各场景权限管控推荐最大值**
| 类型 | linux权限参考最大值 |
| --------------- | --------------------|
| 用户主目录 | 750rwxr-x--- |
| 程序文件(含脚本文件、库文件等) | 550r-xr-x--- |
| 程序文件目录 | 550r-xr-x--- |
| 配置文件 | 640rw-r----- |
| 配置文件目录 | 750rwxr-x--- |
| 日志文件(记录完毕或者已经归档) | 440r--r----- |
| 日志文件(正在记录) | 640rw-r----- |
| 日志文件记录 | 750rwxr-x--- |
| Debug文件 | 640rw-r----- |
| Debug文件目录 | 750 (rwxr-x---) |
| 临时文件目录 | 750rwxr-x--- |
| 维护升级文件目录 | 770rwxrwx--- |
| 业务数据文件 | 640rw-r----- |
| 业务数据文件目录 | 750rwxr-x--- |
| 密钥组件、私钥、证书、密文文件目录 | 700rwx------ |
| 密钥组件、私钥、证书、加密密文 | 600rw------- |
| 加解密接口、加解密脚本 | 500r-x------ |
## 数据安全声明
1. ModelLink会在megatron中的checkpointing模块中保存模型文件其中部分模型文件使用了风险模块pickle可能存在数据风险。
## 运行安全声明
1. 建议用户结合运行资源状况编写对应训练脚本。若训练脚本与资源状况不匹配,如数据集加载内存大小超出内存容量限制、训练脚本在本地生成数据超过磁盘空间大小等情况,可能引发错误并导致进程意外退出。
2. ModelLink内部用到了pytorch,可能会因为版本不匹配导致运行错误具体可参考pytorch[安全声明](https://gitee.com/ascend/pytorch#%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)。
## 公网地址声明
| 类型 | 开源代码地址 | 文件名 | 公网IP地址/公网URL地址/域名/邮箱地址 | 用途说明 |
|--------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|-----------|
| 开源代码引入 | 不涉及 | modellink/model/language_model.py:85 | https://github.com/kingoflolz/mesh-transformer-jax/ | 详情地址 |
| 开源代码引入 | 涉及 | tests/pipeline/common.py:6 | https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/common.py | 源代码地址 |
| 开源代码引入 | 涉及 | tests/pipeline/conftest.py:6 | https://github.com/microsoft/DeepSpeed/blob/master/tests/conftest.py | 源代码地址 |
## 公开接口声明
ModelLink 暂时未发布wheel包无正式对外公开接口所有功能均通过shell脚本调用。5个入口脚本分别为[pretrain_gpt.py](https://gitee.com/ascend/ModelLink/blob/master/pretrain_gpt.py), [inference.py](https://gitee.com/ascend/ModelLink/blob/master/inference.py), [evaluation.py](https://gitee.com/ascend/ModelLink/blob/master/evaluation.py), [preprocess_data.py](https://gitee.com/ascend/ModelLink/blob/master/tools/preprocess_data.py) 和 [convert_ckpt.py](https://gitee.com/ascend/ModelLink/blob/master/tools/checkpoint/convert_ckpt.py)。
## 通信安全加固
[通信安全加固说明](https://gitee.com/ascend/pytorch/blob/master/SECURITYNOTE.md#%E9%80%9A%E4%BF%A1%E5%AE%89%E5%85%A8%E5%8A%A0%E5%9B%BA
)
## 通信矩阵
[通信矩阵说明](https://gitee.com/ascend/pytorch/blob/master/SECURITYNOTE.md#%E9%80%9A%E4%BF%A1%E7%9F%A9%E9%98%B5%E4%BF%A1%E6%81%AF)

View File

@ -1,76 +1,55 @@
import os
import stat
import sys
import unittest
from pathlib import Path
import xmlrunner
# =============================
# ST test, run with shell
# =============================
def success_check(res):
if res != 0:
sys.exit(1)
class UT_Test:
def __init__(self):
base_dir = Path(__file__).absolute().parent.parent
test_dir = os.path.join(base_dir, 'tests')
self.ut_file = os.path.join(test_dir, "ut")
def run_ut(self):
command = f"python3.8 -m pytest -k 'not allocator' {self.ut_file}"
ut_exitcode = os.system(command)
if ut_exitcode == 0:
print("UT test success")
else:
print("UT failed")
exit(1)
def success_check_ut(res):
if len(res.failures) + len(res.errors) != 0:
sys.exit(1)
class ST_Test:
def __init__(self):
base_dir = Path(__file__).absolute().parent.parent
test_dir = os.path.join(base_dir, 'tests')
BASE_DIR = Path(__file__).absolute().parent.parent
TEST_DIR = os.path.join(BASE_DIR, 'tests')
st_dir = "st"
llama_pretrain_shell_file = os.path.join(
test_dir, st_dir, "test_llama_pretrain_ptd.sh")
llama_inference_shell_file = os.path.join(
test_dir, st_dir, "test_llama_inference_ptd.sh")
gemma_pretrain_shell_file = os.path.join(
test_dir, st_dir, "test_gemma_pretrain_ptd.sh")
gemma_inference_shell_file = os.path.join(
test_dir, st_dir, "test_gemma_inference_ptd.sh")
llama_vpp_pretrain_shell_file = os.path.join(
test_dir, st_dir, "test_llama_vpp_pretrain_ptd.sh")
llama_instruction_shell_file = os.path.join(
test_dir, st_dir, "test_llama_instruction_ptd.sh")
llama_dir = "test_llama"
bloom_dir = "test_bloom"
self.st_file_list = [
llama_pretrain_shell_file,
llama_inference_shell_file,
gemma_pretrain_shell_file,
gemma_inference_shell_file,
llama_vpp_pretrain_shell_file,
llama_instruction_shell_file
bloom_shell_file = os.path.join(
TEST_DIR, st_dir, bloom_dir, "test_bloom_ptd.sh")
llama_shell_file = os.path.join(
TEST_DIR, st_dir, llama_dir, "test_llama_ptd.sh")
lora_shell_file = os.path.join(
TEST_DIR, st_dir, llama_dir, "test_lora_llama_ptd.sh")
llama_inference_shell_file = os.path.join(
TEST_DIR, st_dir, llama_dir, "test_llama_inference_ptd.sh")
# 待新ST上来再恢复
self.shell_file_list = [
# llama_inference_shell_file,
# llama_shell_file,
# bloom_shell_file,
# lora_shell_file,
]
def run_st(self):
all_success = True
for shell_file in self.st_file_list:
command = f"sh {shell_file}"
st_exitcode = os.system(command)
if st_exitcode != 0:
all_success = False
print(f"ST run {shell_file} failed")
break
if all_success:
print("ST test success")
else:
print("ST failed")
exit(1)
def run_shell(self):
for shell_file in self.shell_file_list:
success_check(os.system("sh {}".format(shell_file)))
# ===============================================
@ -79,7 +58,9 @@ class ST_Test:
if __name__ == "__main__":
ut = UT_Test()
ut.run_ut()
st = ST_Test()
st.run_st()
st_test = ST_Test()
st_test.run_shell()
test_loader = unittest.TestLoader()
discover = test_loader.discover(start_dir="../tests/ut", pattern="test*.py")
runner = unittest.TextTestRunner()
success_check_ut(runner.run(discover))

View File

@ -1,4 +1,4 @@
# Aquila-7B
# Aquila-7B $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/bloom/README_en.md">English</a> </b>
@ -24,146 +24,148 @@ Aquila-7B 训练的硬件配置如下:
### 脚本
1. 克隆仓库到本地服务器并切换到modellink分支
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建conda环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 通过互联网上提供的pip源安装 torch可能需要尝试合适的包含这个torch==2.1.0版本的pip源
pip install torch==2.1.0
# 通过PTA上提供的安装包以whl文件方式安装aarch64架构上的2.1.0版本的torch_npu
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
# 通过PTA上提供的安装包以whl文件方式安装apex
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 拉取MindSpeed源代码进入MindSpeed目录然后源码方式安装mindspeed加速包
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖包
pip install -r requirements.txt
```
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 通过互联网上提供的pip源安装 torch可能需要尝试合适的包含这个torch==2.1.0版本的pip源
pip install torch==2.1.0
# 通过PTA上提供的安装包以whl文件方式安装aarch64架构上的2.1.0版本的torch_npu
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
# 通过PTA上提供的安装包以whl文件方式安装apex
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 拉取MindSpeed源代码进入MindSpeed目录然后源码方式安装mindspeed加速包
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖包
pip install -r requirements.txt
```
3. 使用浏览器下载 [Aquila-7B模型的配置tokenizer和预训练权重](https://huggingface.co/BAAI/Aquila-7B/tree/main)
保存在 ModelLink/model_from_hf/Aquila-7B/ 目录。
保存在 ModelLink/model_from_hf/Aquila-7B/ 目录。
4. 数据预处理
第一步,使用浏览器 [下载数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 保存在ModelLink/dataset/ 目录
第一步,使用浏览器 [下载数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 保存在ModelLink/dataset/ 目录
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
第二步使用Aquila-7B指定的tokenizer处理数据集
第二步使用Aquila-7B指定的tokenizer处理数据集
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
--output-prefix ./dataset/Aquila-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
--output-prefix ./dataset/Aquila-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5. 权重转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila-7B/ \
--save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila-7B/ \
--save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Aquila-7B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Aquila-7B/mg2hg/
```
6. 配置 Aquila-7B 预训练脚本
需要在预训练脚本中配置相关参数
需要在预训练脚本中配置相关参数
```shell
# 根据实际情况配置词表、数据集、模型参数保存路径
TOKENIZER_PATH="./model_from_hf/Aquila-7B/" #tokenizer 路径
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document" #数据集 路径
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
CKPT_SAVE_DIR="./ckpt/Aquila-7B/"
# 如果不需要保存权重就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
# 如果需要保存权重则需要设置CKPT_SAVE_DIR, 并且启动脚本里应使用 `--save $CKPT_SAVE_DIR` 进行类似配置。
# 如果不需要加载权重就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
# 如果需要加载权重则需要设置CKPT_LOAD_DIR, 并且启动脚本里应使用 `--load $CKPT_LOAD_DIR` 进行类似配置。
# 进行断点续训时应先按以上save的场景配置待完成ckpt保存后再修改相应参数按以上load的场景加载已保存的ckpt。
```
```shell
# 根据实际情况配置词表、数据集、模型参数保存路径
TOKENIZER_PATH="./model_from_hf/Aquila-7B/" #tokenizer 路径
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document" #数据集 路径
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
CKPT_SAVE_DIR="./ckpt/Aquila-7B/"
# 如果不需要保存权重就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
# 如果需要保存权重则需要设置CKPT_SAVE_DIR, 并且启动脚本里应使用 `--save $CKPT_SAVE_DIR` 进行类似配置。
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
# 如果不需要加载权重就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
# 如果需要加载权重则需要设置CKPT_LOAD_DIR, 并且启动脚本里应使用 `--load $CKPT_LOAD_DIR` 进行类似配置。
# 进行断点续训时应先按以上save的场景配置待完成ckpt保存后再修改相应参数按以上load的场景加载已保存的ckpt。
```
7. 启动 Aquila-7B 预训练脚本
运行预训练脚本前需先执行set_env.sh脚本以便设置环境参数或者也可将其放入预训练脚本中执行。
运行预训练脚本前需先执行set_env.sh脚本以便设置环境参数或者也可将其放入预训练脚本中执行。
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
按以下方式启动Aquila-7B预训练
按以下方式启动Aquila-7B预训练
```shell
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
```shell
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -171,10 +173,10 @@ Aquila-7B 训练的硬件配置如下:
Aquila-7B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|------------|------|------------------------|----------------------|
| NPU | Aquila-7B | 1000 | 2849 | 5.75 |
| 参考 | Aquila-7B | 1000 | 2874 | 5.70 |
| 设备 | 硬件 | 模型 | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|---------------|------------|------|------------------------|----------------------|
| NPU | 910b 1node*8p | Aquila-7B | 1000 | 2849 | 5.75 |
| 参考 | | Aquila-7B | 1000 | 2874 | 5.70 |
@ -184,7 +186,7 @@ Aquila-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
推理与预训练不同,我们必须加载预训练权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila-7B推理脚本`example/aquila/generate_aquila_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径等(下面样例仅供参考)
权重转换完成后我们配置Aquila-7B推理脚本`tasks/inference/generate_aquila_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径等(下面样例仅供参考)
```shell
# 请按实际情况修改模型权重路径和分词器路径
@ -195,14 +197,14 @@ TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
启动Aquila-7B推理:
```shell
bash examples/aquila/generate_aquila_7b_ptd.sh
bash ./tasks/inference/generate_aquila_7b_ptd.sh
```
部分推理样如下:
部分推理样如下:
Aquila-7B:
![aquila-7B_generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila/aquila_7B_generate_ptd_0205.png)
![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate_ptd_0205.png)
## 评估
@ -210,7 +212,7 @@ Aquila-7B:
评估与推理类似,也必须加载转换后的权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila-7B评估脚本 `examples/aquila/evaluate_aquila_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径,评估数据的路径,以及评估任务的名字等(下面样例仅供参考)
权重转换完成后我们配置Aquila-7B评估脚本 `tasks/evaluation/evaluate_aquila_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径,评估数据的路径,以及评估任务的名字等(下面样例仅供参考)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
@ -222,7 +224,7 @@ TASK="boolq"
启动Aquila-7B评估
```shell
bash examples/aquila/evaluate_aquila_7b_ptd.sh
bash tasks/evaluation/evaluate_aquila_7b_ptd.sh
```
Aquila-7B在**Ascend NPU**中的评测表现:

View File

@ -1,4 +1,4 @@
# Aquila-7B
# Aquila-7B $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
@ -26,140 +26,141 @@ Here's a hardware summary of pre-training Aquila-7B:
1. Clone the repository to your local server and switch to modellink branch:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build conda environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch==2.1.0
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch==2.1.0
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# git clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Download the Aquila-7B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila-7B/tree/main)
save to ModelLink/model_from_hf/Aquila7B/ directory.
save to ModelLink/HF_Aquila7B_downloaded/ directory.
4. Prepare dataset.
Prepare dataset.
step1: Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
step2: use Aquila-7B specified tokenizer to pre-process data:
```shell
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
--output-prefix ./dataset/Aquila-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
step1: Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.
4. Weights convert
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
step2: use Aquila-7B specified tokenizer to pre-process data:
```shell
# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila-7B/ \
--save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
```
```shell
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
--output-prefix ./dataset/Aquila-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
5. Weights convert
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/
```
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
5. Config Aquila-7B pre-training script.
```shell
# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
Config the environment variables in aquila pretrain script
python tools/checkpoint/util.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila-7B/ \
--save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json
```
```shell
# set dataset path, CKPT load path for loading weights, and the tokenizer path
TOKENIZER_PATH="./model_from_hf/Aquila-7B/" #tokenizer path
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/" # pointing to the converted model weights
CKPT_SAVE_DIR="./ckpt/Aquila-7B/" # pointing to the path to save checkpoints
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
*Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
*If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save $CKPT_SAVE_DIR` parameter from the training script, and vice versa*
*When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/
```
6. Launch Aquila-7B pre-training script.
6. Config Aquila-7B pre-training script.
Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
Config the environment variables in aquila pretrain script
```shell
# you may need to change the path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
```shell
# set dataset path, CKPT load path for loading weights, and the tokenizer path
TOKENIZER_PATH="./model_from_hf/Aquila-7B/" #tokenizer path
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/" # pointing to the converted model weights
CKPT_SAVE_DIR="./ckpt/Aquila-7B/" # pointing to the path to save checkpoints
```
Start pre-training Aquila-7B model:
*Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
*If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save $CKPT_SAVE_DIR` parameter from the training script, and vice versa*
*When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
```shell
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
7. Launch Aquila-7B pre-training script.
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
```shell
# you may need to change the path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
Start pre-training Aquila-7B model:
```shell
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -167,16 +168,16 @@ Here's a hardware summary of pre-training Aquila-7B:
The performance of Aquila-7B in Ascend NPU and reference device:
| Device | Model | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
| --------- | --------- | ---------- | ---------------------------- | ----------------------------------- |
| NPU | Aquila-7B | 1000 | 2849 | 5.75 |
| Reference | Aquila-7B | 1000 | 2874 | 5.70 |
| Device | Hardware | Model | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
| --------- | ------------- | --------- | ---------- | ---------------------------- | ----------------------------------- |
| NPU | 910b 1node*8p | Aquila-7B | 1000 | 2849 | 5.75 |
| Reference | | Aquila-7B | 1000 | 2874 | 5.70 |
## Inference
We support MindSpeed Inference for text generation with Aquila 7B model.
Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script `examples/aquila/generate_aquila_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila-7B/". In your operation, please fill in correct value based on your actual scenario.
Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script `tasks/inference/generate_aquila_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./HF_Aquila7B_downloaded". In your operation, please fill in correct value based on your actual scenario.
```shell
# please change to actual values
@ -187,12 +188,12 @@ TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
Start Aquila-7B Inference:
```shell
bash ./examples/aquila/generate_aquila_7b_ptd.sh
bash ./tasks/inference/generate_aquila_7b_ptd.sh
```
Sample results of Aquila-7B Inference:
![aquila-7B_generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila/aquila_7B_generate.png)
![aquila-7B_generate.png](../../sources/images/aquila/aquila_7B_generate.png)
## Evaluation with Benchmark
@ -200,7 +201,7 @@ We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark
Evaluation task is similar to inference task tooit also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.
After weight conversion is complete, we configure the Aquila-7B evaluation script `examples/aquila/evaluate_aquila_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
After weight conversion is complete, we configure the Aquila-7B evaluation script `tasks/evaluation/evaluate_aquila_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
@ -212,7 +213,7 @@ TASK="boolq"
Start evaluation task
```shell
bash ./examples/aquila/evaluate_aquila_7b_ptd.sh
bash ./tasks/evaluation/evaluate_aquila_7b_ptd.sh
```
Sample Aquila-7B performance running in **Ascend NPU**:

View File

@ -3,6 +3,7 @@
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
CKPT_SAVE_DIR="your checkpoint save dir"
DATA_PATH="your training data dir"
@ -86,6 +87,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--load $CKPT_LOAD_DIR \
| tee logs/train_aquila_7b_ptd.log

View File

@ -1,524 +0,0 @@
# Aquila2 $\color{black}{\bf\tiny{【社区贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
</p>
- [Aquila2-7B](#7b)
- [训练](#7b-training)
- [脚本](#7b-script)
- [性能](#7b-performance)
- [吞吐](#7b-throughput)
- [推理](#7b-inference)
- [评估](#7b-evaluation)
- [Aquila2-34B](#34b)
- [训练](#34b-training)
- [脚本](#34b-script)
- [性能](#34b-performance)
- [吞吐](#34b-throughput)
- [推理](#34b-inference)
- [评估](#34b-evaluation)
<h1 id="7b">Aquila2-7B</h1>
<h2 id="7b-training">训练</h2>
Aquila2-7B 训练的硬件配置如下:
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 8 x Ascend NPUs |
<h3 id="7b-script">脚本</h3>
1. 克隆仓库到本地服务器并切换到modellink分支
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建conda环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 通过 PTA 包提供的 whl 安装 torch、torch_npu 和 apex例如
pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖包
pip install -r requirements.txt
```
3. 使用浏览器下载 [Aquila2-7B模型的配置tokenizer和预训练权重](https://huggingface.co/BAAI/Aquila2-7B/tree/main)
保存在 ModelLink/model_from_hf/Aquila2-7B/ 目录。
4. 权重转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila2-7B/ \
--save-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--tokenizer-model ./model_from_hf/Aquila2-7B/tokenizer.json
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila2-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Aquila2-7B/mg2hg/
```
权重转换适用于预训练、微调、推理和评估,根据任务不同调整参数 `target-tensor-parallel-size``target-pipeline-parallel-size`
5. 预训练
5.1 准备数据集
下载 Aquila2-7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Aquila2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
--output-prefix ./dataset/Aquila2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 预训练
配置 Aquila2-7B 训练脚本: examples/aquila2/pretrain_aquila2_7b_ptd.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
TOKENIZER_PATH="./model_from_hf/Aquila2-7B/" #tokenizer 路径
DATA_PATH="./dataset/Aquila2-7B/alpaca_text_document" #数据集 路径
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
CKPT_SAVE_DIR="./ckpt/Aquila2-7B/"
```
- 如果不需要加载权重就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
- 如果不需要保存权重就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
- 进行断点续训时应先按以上save的场景配置待完成ckpt保存后再修改相应参数按以上load的场景加载已保存的ckpt。
启动 Aquila2-7B 预训练脚本: examples/aquila2/pretrain_aquila2_7b_ptd.sh
```shell
bash examples/aquila2/pretrain_aquila2_7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/Aquila2-7B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
--output-prefix ./finetune_dataset/Aquila2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本 `pretrain_aquila2_7b_ptd.sh` 一致. *区别是数据集,以及增加训练参数`--is-instruction-dataset`*
增加微调参数`--finetune`,使微调从第一步开始。
```bash
DATA_PATH="./finetune_dataset/Aquila2-7B/alpaca"
CKPT_LOAD_DIR="./ckpt/Aquila2-7B/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
<h3 id="7b-performance">性能</h3>
<h4 id="7b-throughput">吞吐</h4>
Aquila2-7B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|------------|------|------------------------|----------------------|
| NPU | Aquila2-7B | 5000 | 3323 | 4.93 |
| 参考 | Aquila2-7B | 5000 | 2673 | 6.13 |
<h2 id="7b-inference">推理</h2>
我们支持使用 Aquila2-7B进行文本生成的推理。
推理与预训练不同,我们必须加载预训练权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila2-7B推理脚本`examples/aquila2/generate_aquila2_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径等(下面样例仅供参考)
```shell
# 请按实际情况修改模型权重路径和分词器路径
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
```
启动Aquila2-7B推理:
```shell
bash examples/aquila2/generate_aquila2_7b_ptd.sh
```
部分推理样例如下:
Aquila2-7B:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-7b-generate.png)
<h2 id="7b-evaluation">评估</h2>
我们使用 BoolQ benchmark 来评估我们的模型。在[Benchmark下载页面](https://github.com/google-research-datasets/boolean-questions)找到[数据集](https://storage.cloud.google.com/boolq/dev.jsonl)下载后保存。例如保存在ModelLink/boolq/test目录下。
评估与推理类似,也必须加载转换后的权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila2-7B评估脚本 `examples/aquila2/evaluate_aquila2_7b_ptd.sh`,需要正确指定加载权重的路径,词表路径,评估数据的路径,以及评估任务的名字等(下面样例仅供参考)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"
```
启动Aquila2-7B评估
```shell
bash examples/aquila2/evaluate_aquila2_7b_ptd.sh
```
Aquila2-7B在**Ascend NPU**中的评测表现:
| 任务 | 模型 | 昇腾值|社区值|
|------------------------------------------------------------------------|------------|--------|------|
| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-7B | 77.8% | 77.6% |
<h1 id="34b">Aquila2-34B</h1>
<h2 id="34b-training">训练</h2>
Aquila2-34B 训练的硬件配置如下:
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 16 x Ascend NPUs |
<h3 id="34b-script">脚本</h3>
1. 克隆仓库到本地服务器并切换到modellink分支
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建conda环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 通过 PTA 包提供的 whl 安装 torch、torch_npu 和 apex例如
pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖包
pip install -r requirements.txt
```
3. 使用浏览器下载 [Aquila2-34B模型的配置tokenizer和预训练权重](https://huggingface.co/BAAI/Aquila2-34B/tree/main)
保存在 ModelLink/model_from_hf/Aquila2-34B/ 目录。
4. 权重转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila2-34B/ \
--save-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 2 \
--tokenizer-model ./model_from_hf/Aquila2-34B/tokenizer.json \
--params-dtype bf16
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila2-34B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Aquila2-34B/mg2hg/
```
权重转换适用于预训练、微调、推理和评估,根据任务不同调整参数 `target-tensor-parallel-size``target-pipeline-parallel-size`
5. 预训练
5.1 准备数据集
下载 Aquila2-34B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Aquila2-34B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
--output-prefix ./dataset/Aquila2-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 预训练
配置 Aquila2-34B 训练脚本: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
TOKENIZER_PATH="./model_from_hf/Aquila2-34B/" #tokenizer 路径
DATA_PATH="./dataset/Aquila2-34B/alpaca_text_document" #数据集 路径
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp2/"
CKPT_SAVE_DIR="./ckpt/Aquila2-34B/"
```
- 如果不需要加载权重就不需要设置CKPT_LOAD_DIR, 并且启动脚本里应不使用 `--load` 参数
- 如果不需要保存权重就不需要设置CKPT_SAVE_DIR, 并且启动脚本里应不使用 `--save` 参数
- 进行断点续训时应先按以上save的场景配置待完成ckpt保存后再修改相应参数按以上load的场景加载已保存的ckpt。
启动 Aquila2-34B 预训练脚本: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
```shell
bash examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/Aquila2-34B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
--output-prefix ./finetune_dataset/Aquila2-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本 `pretrain_aquila2_34b_ptd_16p.sh` 一致. *区别是数据集,以及增加训练参数`--is-instruction-dataset`*
增加微调参数`--finetune`,使微调从第一步开始。
```bash
DATA_PATH="./finetune_dataset/Aquila2-34B/alpaca"
CKPT_LOAD_DIR="./ckpt/Aquila2-34B/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
<h3 id="34b-performance">性能</h3>
<h4 id="34b-throughput">吞吐</h4>
Aquila2-34B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数| token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|------|------------|------|------------------------|----------------------|
| NPU | Aquila2-34B | 5000 | 854 | 307 |
| 参考 | Aquila2-34B | 5000 | 732 | 358 |
<h2 id="34b-inference">推理</h2>
我们支持使用 Aquila2-34B进行文本生成的推理。
推理与预训练不同,我们必须加载预训练权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila2-34B推理脚本`examples/aquila2/generate_aquila2_34b_ptd.sh`,需要正确指定加载权重的路径,词表路径等(下面样例仅供参考)
```shell
# 请按实际情况修改模型权重路径和分词器路径
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
```
启动Aquila2-34B推理:
```shell
bash examples/aquila2/generate_aquila2_34b_ptd.sh
```
部分推理样例如下:
Aquila2-34B:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-34b-generate.png)
<h2 id="34b-evaluation">评估</h2>
我们使用 BoolQ benchmark 来评估我们的模型。在[Benchmark下载页面](https://github.com/google-research-datasets/boolean-questions)找到[数据集](https://storage.cloud.google.com/boolq/dev.jsonl)下载后保存。例如保存在ModelLink/boolq/test目录下。
评估与推理类似,也必须加载转换后的权重,请注意:在转换权重时使用的模型结构参数,和运行评估任务时使用的模型结构参数,应保持一致。
权重转换完成后我们配置Aquila2-34B评估脚本 `examples/aquila2/evaluate_aquila2_34b_ptd.sh`,需要正确指定加载权重的路径,词表路径,评估数据的路径,以及评估任务的名字等(下面样例仅供参考)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"
```
启动Aquila2-34B评估
```shell
bash examples/aquila2/evaluate_aquila2_34b_ptd.sh
```
Aquila2-34B在**Ascend NPU**中的评测表现:
| 任务 | 模型 | 昇腾值|社区值|
|------------------------------------------------------------------------|------------|--------|------|
| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-34B | 88.0% | 87.0% |

View File

@ -1,514 +0,0 @@
# Aquila2 $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
</p>
</p>
- [Aquila2-7B](#7b)
- [Training](#7b-training)
- [Script](#7b-script)
- [Performance](#7b-performance)
- [Machine performance](#7b-throughput)
- [Inference](#7b-inference)
- [Evaluation](#7b-evaluation)
- [Aquila2-34B](#34b)
- [Training](#34b-training)
- [Script](#34b-script)
- [Performance](#34b-performance)
- [Machine performance](#34b-throughput)
- [Inference](#34b-inference)
- [Evaluation](#34b-evaluation)
<h1 id="7b">Aquila2-7B</h1>
<h2 id="7b-training">Training</h2>
Here's a hardware summary of pre-training Aquila2-7B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
<h3 id="7b-script">Script</h3>
1. Clone the repository to your local server and switch to modellink branch:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build conda environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Download the Aquila2-7B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila2-7B/tree/main)
save to ModelLink/model_from_hf/Aquila2-7B/ directory.
4. Weights convert
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila2-7B/ \
--save-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--tokenizer-model ./model_from_hf/Aquila2-7B/tokenizer.json
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila2-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila2-7B/mg2hg/
```
Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
5. Pre-training
5.1 Prepare dataset
Download the Aquila2-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Aquila2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
--output-prefix ./dataset/Aquila2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 Pre-training
Config Aquila2-7B pre-training script : examples/codellama/pretrain_aquila2_7b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Aquila2-7B/"
DATA_PATH="./dataset/Aquila2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Aquila2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
```
- *If you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
- *If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save` parameter from the training script, and vice versa*
- *When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
Launch Aquila2-7B pre-training script: examples/aquila2/pretrain_aquila2_7b_ptd.sh
```shell
bash examples/aquila2/pretrain_aquila2_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
6. Fine-tuning
6.1 Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/Aquila2-7B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-7B/ \
--output-prefix ./finetune_dataset/Aquila2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for `pretrain_aquila2_7b_ptd.sh`.*The difference is that the dataset and the training parameter `--is-instruction-dataset` are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
DATA_PATH="./finetune_dataset/Aquila2-7B/alpaca"
CKPT_LOAD_DIR="./ckpt/Aquila2-7B/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
<h3 id="7b-performance">Performance</h3>
<h4 id="7b-throughput">Machine performance</h4>
The performance of Aquila2-7B in Ascend NPU and reference device:
| Device | Model | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
| --------- | ---------- | ---------- | ---------------------------- | ----------------------------------- |
| NPU | Aquila2-7B | 5000 | 3323 | 4.93 |
| Reference | Aquila2-7B | 5000 | 2673 | 6.13 |
<h2 id="7b-inference">Inference</h2>
We support MindSpeed Inference for text generation with Aquila 7B model.
Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila2-7B Inference shell script `examples/aquila2/generate_aquila2_7b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila2-7B/". In your operation, please fill in correct value based on your actual scenario.
```shell
# please change to actual values
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
```
Start Aquila2-7B Inference:
```shell
bash ./examples/aquila2/generate_aquila2_7b_ptd.sh
```
Sample results of Aquila2-7B Inference:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-7b-generate.png)
<h2 id="7b-evaluation">Evaluation</h2>
We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark page](https://github.com/google-research-datasets/boolean-questions) and find the [dataset](https://storage.cloud.google.com/boolq/dev.jsonl), download it and save it. For example, save to "ModelLink/boolq/test" directory
Evaluation task is similar to inference task tooit also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.
After weight conversion is complete, we configure the Aquila2-7B evaluation script `examples/aquila2/evaluate_aquila2_7b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-7B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"
```
Start evaluation task
```shell
bash ./examples/aquila2/evaluate_aquila2_7b_ptd.sh
```
Sample Aquila2-7B performance running in **Ascend NPU**:
| Task | Model | NPU | Benchmark |
| ---------------------------------------------------------------------- | --------- | ----- | --------- |
| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-7B | 77.8% | 77.6% |
<h1 id="34b">Aquila2-34B</h1>
<h2 id="34b-training">Training</h2>
Here's a hardware summary of pre-training Aquila2-34B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 16 x Ascend NPUs |
<h3 id="34b-script">Script</h3>
1. Clone the repository to your local server and switch to modellink branch:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build conda environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38-linux_aarch64.whl
# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Download the Aquila2-34B model, config, and tokenizer from [here](https://huggingface.co/BAAI/Aquila2-34B/tree/main)
save to ModelLink/model_from_hf/Aquila2-34B/ directory.
4. Weights convert
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--load-dir ./model_from_hf/Aquila2-34B/ \
--save-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 2 \
--tokenizer-model ./model_from_hf/Aquila2-34B/tokenizer.json \
--params-dtype bf16
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Aquila2-34B-v0.1-tp8-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Aquila2-34B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila2-34B/mg2hg/
```
Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
5. Pre-training
5.1 Prepare dataset
Download the Aquila2-34B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Aquila2-34B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
--output-prefix ./dataset/Aquila2-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 Pre-training
Config Aquila2-34B pre-training script : examples/codellama/pretrain_aquila2_34b_ptd_16p.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Aquila2-34B/"
DATA_PATH="./dataset/Aquila2-34B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Aquila2-34B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp2/"
```
- *If you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the `--load` parameter from the training script, and vice versa*
- *If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the `--save` parameter from the training script, and vice versa*
- *When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.*
Launch Aquila2-34B pre-training script: examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
```shell
bash examples/aquila2/pretrain_aquila2_34b_ptd_16p.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
6. Fine-tuning
6.1 Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/Aquila2-34B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Aquila2-34B/ \
--output-prefix ./finetune_dataset/Aquila2-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for `pretrain_aquila2_34b_ptd_16p.sh`.*The difference is that the dataset and the training parameter `--is-instruction-dataset` are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
DATA_PATH="./finetune_dataset/Aquila2-34B/alpaca"
CKPT_LOAD_DIR="./ckpt/Aquila2-34B/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
<h3 id="34b-performance">Performance</h3>
<h4 id="34b-throughput">Machine performance</h4>
The performance of Aquila2-34B in Ascend NPU and reference device:
| Device | Model | Iterations | throughput rate (tokens/p/s) | single iteration step time (s/step) |
| --------- | ----------- | ---------- | ---------------------------- | ----------------------------------- |
| NPU | Aquila2-34B | 5000 | 854 | 307 |
| Reference | Aquila2-34B | 5000 | 732 | 358 |
<h2 id="34b-inference">Inference</h2>
We support MindSpeed Inference for text generation with Aquila 34B model.
Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila2-34B Inference shell script `examples/aquila2/generate_aquila2_34b_ptd.sh`. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila2-34B/". In your operation, please fill in correct value based on your actual scenario.
```shell
# please change to actual values
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
```
Start Aquila2-34B Inference:
```shell
bash ./examples/aquila2/generate_aquila2_34b_ptd.sh
```
Sample results of Aquila2-34B Inference:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/aquila2/aquila2-34b-generate.png)
<h2 id="34b-evaluation">Evaluation</h2>
We use BoolQ benchmark to evaluate our model. You can [go to the BoolQ Benchmark page](https://github.com/google-research-datasets/boolean-questions) and find the [dataset](https://storage.cloud.google.com/boolq/dev.jsonl), download it and save it. For example, save to "ModelLink/boolq/test" directory
Evaluation task is similar to inference task tooit also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.
After weight conversion is complete, we configure the Aquila2-34B evaluation script `examples/aquila2/evaluate_aquila2_34b_ptd.sh`. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)
```shell
CKPT_LOAD_DIR="./model_weights/Aquila2-34B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila2-34B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"
```
Start evaluation task
```shell
bash ./examples/aquila2/evaluate_aquila2_34b_ptd.sh
```
Sample Aquila2-34B performance running in **Ascend NPU**:
| Task | Model | NPU | Benchmark |
| ---------------------------------------------------------------------- | --------- | ----- | --------- |
| [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Aquila2-34B | 88.0% | 87.0% |

View File

@ -1,66 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
EVAL_DATA_PATH="your eval data dir"
TASK="your task name"
# Change for multinode config
TP=8
PP=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--attention-softmax-in-fp32 \
--bf16 \
--disable-bias-linear \
--exit-on-missing-checkpoint \
--ffn-hidden-size 24576 \
--group-query-attention \
--hidden-size 6144 \
--load $CKPT_LOAD_DIR \
--make-vocab-size-divisible-by 1 \
--max-new-tokens 1 \
--max-position-embeddings 4096 \
--micro-batch-size 1 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 48 \
--num-layers 60 \
--num-query-groups 8 \
--pipeline-model-parallel-size $PP \
--position-embedding-type rope \
--seq-length 4096 \
--swiglu \
--task $TASK \
--task-data-path $EVAL_DATA_PATH \
--tensor-model-parallel-size $TP \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--tokenizer-type PretrainedFromHF \
--untie-embeddings-and-output-weights \
--use-fused-rmsnorm \
| tee logs/eval_aquila2_34b_${TASK}_ptd.log

View File

@ -1,64 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
EVAL_DATA_PATH="your eval data dir"
TASK="your task name"
# Change for multinode config
TP=8
PP=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--attention-softmax-in-fp32 \
--disable-bias-linear \
--exit-on-missing-checkpoint \
--ffn-hidden-size 11008 \
--fp16 \
--hidden-size 4096 \
--load $CKPT_LOAD_DIR \
--make-vocab-size-divisible-by 1 \
--max-new-tokens 1 \
--max-position-embeddings 2048 \
--micro-batch-size 1 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 32 \
--num-layers 32 \
--pipeline-model-parallel-size ${PP} \
--position-embedding-type rope \
--seq-length 2048 \
--swiglu \
--task $TASK\
--task-data-path $EVAL_DATA_PATH \
--tensor-model-parallel-size ${TP} \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--tokenizer-type PretrainedFromHF \
--untie-embeddings-and-output-weights \
--use-fused-rmsnorm \
| tee logs/eval_aquila2_7b_${TASK}_ptd.log

View File

@ -1,61 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
# Change for multinode config
TP=8
PP=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS inference.py \
--attention-softmax-in-fp32 \
--bf16 \
--disable-bias-linear \
--exit-on-missing-checkpoint \
--ffn-hidden-size 24576 \
--group-query-attention \
--hidden-size 6144 \
--load $CKPT_LOAD_DIR \
--make-vocab-size-divisible-by 1 \
--max-new-tokens 512 \
--max-position-embeddings 4096 \
--micro-batch-size 1 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 48 \
--num-layers 60 \
--num-query-groups 8 \
--pipeline-model-parallel-size $PP \
--position-embedding-type rope \
--seq-length 4096 \
--swiglu \
--tensor-model-parallel-size $TP \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--tokenizer-type PretrainedFromHF \
--untie-embeddings-and-output-weights \
--use-fused-rmsnorm \
| tee logs/generate_aquila2_34b_ptd.log

View File

@ -1,58 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
# Change for multinode config
TP=8
PP=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS inference.py \
--attention-softmax-in-fp32 \
--disable-bias-linear \
--exit-on-missing-checkpoint \
--ffn-hidden-size 11008 \
--hidden-size 4096 \
--load $CKPT_LOAD_DIR \
--make-vocab-size-divisible-by 1 \
--max-new-tokens 512 \
--max-position-embeddings 2048 \
--micro-batch-size 1 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 32 \
--num-layers 32 \
--pipeline-model-parallel-size ${PP} \
--position-embedding-type rope \
--seq-length 2048 \
--swiglu \
--tensor-model-parallel-size ${TP} \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--tokenizer-type PretrainedFromHF \
--untie-embeddings-and-output-weights \
--use-fused-rmsnorm \
| tee logs/generate_aquila2_7b_ptd.log

View File

@ -1,96 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_SAVE_DIR="your checkpoint save dir"
DATA_PATH="your training data dir"
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
# Change for multinode config
TP=8
PP=2
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=2
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--attention-dropout 0.0 \
--attention-softmax-in-fp32 \
--bf16 \
--clip-grad 1.0 \
--disable-bias-linear \
--ffn-hidden-size 24576 \
--global-batch-size 1024 \
--group-query-attention \
--hidden-dropout 0.0 \
--hidden-size 6144 \
--init-method-std 0.01 \
--initial-loss-scale 524288.0 \
--lr 8.0e-8 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--make-vocab-size-divisible-by 1 \
--max-position-embeddings 4096 \
--micro-batch-size 2 \
--min-lr 1.0e-8 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 48 \
--num-layers 60 \
--num-query-groups 8 \
--pipeline-model-parallel-size ${PP} \
--position-embedding-type rope \
--seq-length 4096 \
--sequence-parallel \
--swiglu \
--tensor-model-parallel-size ${TP} \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-type PretrainedFromHF \
--train-iters 2000 \
--untie-embeddings-and-output-weights \
--use-flash-attn \
--use-fused-rmsnorm \
--use-mc2 \
--weight-decay 1e-2 \
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 1000 \
--eval-interval 1000 \
--eval-iters 0
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load $CKPT_LOAD_DIR \
| tee logs/train_aquila2_34b_ptd.log

View File

@ -1,94 +0,0 @@
#!/bin/bash
# See README, please remember to source the set_env.sh file in CLI, or here
# source /path/to/your/ascend-toolkit/set_env.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CKPT_SAVE_DIR="your checkpoint save dir"
DATA_PATH="your training data dir"
CKPT_LOAD_DIR="your checkpoint load dir"
TOKENIZER_PATH="your tokenizer path"
# Change for multinode config
TP=8
PP=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--attention-dropout 0.0 \
--attention-softmax-in-fp32 \
--clip-grad 1.0 \
--disable-bias-linear \
--ffn-hidden-size 11008 \
--fp16 \
--global-batch-size 64 \
--hidden-dropout 0.0 \
--hidden-size 4096 \
--init-method-std 0.01 \
--initial-loss-scale 65536 \
--lr 1.0e-7 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--make-vocab-size-divisible-by 1 \
--max-position-embeddings 2048 \
--micro-batch-size 8 \
--min-lr 1.0e-8 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--norm-epsilon 1e-5 \
--normalization RMSNorm \
--num-attention-heads 32 \
--num-layers 32 \
--pipeline-model-parallel-size ${PP} \
--position-embedding-type rope \
--seq-length 2048 \
--sequence-parallel \
--swiglu \
--tensor-model-parallel-size ${TP} \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-type PretrainedFromHF \
--train-iters 2000 \
--untie-embeddings-and-output-weights \
--use-flash-attn \
--use-fused-rmsnorm \
--use-mc2 \
--weight-decay 1e-1 \
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 1000 \
--eval-interval 1000 \
--eval-iters 0
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load $CKPT_LOAD_DIR \
| tee logs/train_aquila2_7b_ptd.log

View File

@ -1,4 +1,4 @@
# BaiChuan
# BaiChuan $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
@ -22,6 +22,7 @@
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [Lora微调](#Lora微调)
- [推理](#推理)
- [评估](#评估)
@ -37,146 +38,148 @@ Baichuan-7B 训练的硬件配置如下:
### 脚本
1. 克隆仓库到本地服务器
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. (可选)准备预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重:
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重:
```shell
mkdir ./model_from_hf/Baichuan-7B/
cd ./model_from_hf/Baichuan-7B/
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan-7B/
cd ./model_from_hf/Baichuan-7B/
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ../../
```
4. 数据转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Baichuan-7B/ \
--save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
--w-pack True
```
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Baichuan-7B/ \
--save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
--w-pack True
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan-7B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan-7B/mg2hg/
```
5. 准备数据集
从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集:
从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集:
```shell
# 下载数据集
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# 下载数据集
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Baichuan-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
--output-prefix ./dataset/Baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# 处理数据
mkdir ./dataset/Baichuan-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
--output-prefix ./dataset/Baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
```
7. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh
```shell
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -193,7 +196,7 @@ Baichuan-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
## 推理
首先需要配置Baichuan-7B的推理脚本: examples/baichuan/generate_baichuan_7b_ptd.sh
首先需要配置Baichuan-7B的推理脚本: tasks/inference/generate_baichuan_7b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
@ -207,12 +210,12 @@ TOKENIZER_PATH="./model_from_hf/Baichuan-7B/"
然后可直接启动generate_baichuan_7b_ptd.sh
```bash
bash examples/baichuan/generate_baichuan_7b_ptd.sh
bash tasks/inference/generate_baichuan_7b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_7B_inference.png)
![Inference](../../sources/images/baichuan/baichuan_7B_inference.png)
## 评估
@ -228,7 +231,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan/evaluate_baichuan_7B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
```
<table>
@ -266,152 +269,154 @@ Baichuan-13B 训练的硬件配置如下:
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
# 安装其余依赖库
pip install -r requirements.txt
```
```
**注意:**在后面的任务执行过程中如果出现报错:`AttributeError: 'BaichuanTokenizer object has no attribute 'sp_model'`,请执行下面命令解决这个问题:
**注意:**在后面的任务执行过程中如果出现报错:`AttributeError: 'BaichuanTokenizer object has no attribute 'sp_model'`,请执行下面命令解决这个问题:
```shell
pip install transformers==4.32.0 --force
```
```shell
pip install transformers==4.32.0 --force
```
3. (可选的)准备预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重
```shell
mkdir ./model_from_hf/Baichuan-13B/
cd ./model_from_hf/Baichuan-13B/
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan-13B/
cd ./model_from_hf/Baichuan-13B/
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ../../
```
4. 权重转换
将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan-13B/ \
--save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan-13B/ \
--save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-13B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan-13B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-13B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan-13B/mg2hg/
```
5. 准备数据集
下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/Baichuan-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
--output-prefix ./dataset/Baichuan-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
mkdir ./dataset/Baichuan-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
--output-prefix ./dataset/Baichuan-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Baichuan-13B 训练脚本(Baichuan-13B暂不支持Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
```
7. 启动 Baichuan-13B 训练脚本: examples/baichuan/pretrain_baichuan_ptd_13B.sh
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -429,7 +434,7 @@ Baichuan-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
## 推理
配置baichuan-13B的推理脚本: examples/baichuan/generate_baichuan_13b_ptd.sh
配置baichuan-13B的推理脚本: tasks/inference/generate_baichuan_13b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
@ -443,11 +448,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan-13B/"
然后可直接启动generate_baichuan_13b_ptd.sh
```bash
bash examples/baichuan/generate_baichuan_13b_ptd.sh
bash tasks/inference/generate_baichuan_13b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_13B_inference.png)
![Inference](../../sources/images/baichuan/baichuan_13B_inference.png)
## 评估
@ -463,7 +468,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan/evaluate_baichuan_13B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
```
<table>

View File

@ -1,4 +1,4 @@
# BaiChuan
# BaiChuan $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
@ -40,143 +40,146 @@ Here's a hardware summary of pre-training Baichuan-7B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main)
Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main)
```shell
mkdir ./model_from_hf/Baichuan-7B/
cd ./model_from_hf/Baichuan-7B/
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan-7B/
cd ./model_from_hf/Baichuan-7B/
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ../../
```
4. Weights convert
In order to adapt to the Baichuan-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
In order to adapt to the Baichuan-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Baichuan-7B/ \
--save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
--w-pack True
```
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Baichuan-7B/ \
--save-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-7B/tokenizer.model \
--w-pack True
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-7B/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-7B/mg2hg/
```
5. Prepare dataset
Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Baichuan-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
--output-prefix ./dataset/Baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# process datasets
mkdir ./dataset/Baichuan-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-7B/ \
--output-prefix ./dataset/Baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_ptd_7B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Baichuan-7B/"
DATA_PATH="./dataset/Baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
```
7. Launch Baichuan-7B pre-training script: examples/baichuan/pretrain_baichuan_ptd_7B.sh
7. Launch Baichuan-7B pre-training script: tasks/inference/generate_baichuan_7b_ptd.sh
```shell
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -193,7 +196,7 @@ The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
## Inference
Config Baichuan-7B inference script: examples/baichuan/generate_baichuan_7b_ptd.sh
Config Baichuan-7B inference script: tasks/inference/generate_baichuan_7b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
@ -204,15 +207,15 @@ CHECKPOINT="./model_weights/Baichuan-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan-7B/"
```
Launch Baichuan-7B inference script: examples/baichuan/generate_baichuan_7b_ptd.sh
Launch Baichuan-7B inference script: tasks/inference/generate_baichuan_7b_ptd.sh
```bash
bash examples/baichuan/generate_baichuan_7b_ptd.sh
bash tasks/inference/generate_baichuan_7b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_7B_inference.png)
![Inference](../../sources/images/baichuan/baichuan_7B_inference.png)
## Evaluation
@ -228,7 +231,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan/evaluate_baichuan_7B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
```
<table>
@ -268,154 +271,156 @@ Here's a hardware summary of pre-training Baichuan-13B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
#install Mindspeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
#install Mindspeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
**Note:** If the error message "'AttributeError: 'BaichuanTokenizer' object has no attribute'sp_model'" is displayed during the script execution, run the following command to rectify the error:
**Note:** If the error message "'AttributeError: 'BaichuanTokenizer' object has no attribute'sp_model'" is displayed during the script execution, run the following command to rectify the error:
```shell
pip install transformers==4.32.0 --force
```
```shell
pip install transformers==4.32.0 --force
```
3. Prepare pretrained weights
Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main)
```shell
mkdir ./model_from_hf/Baichuan-13B/
cd ./model_from_hf/Baichuan-13B/
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ../../
```
Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main)
```shell
mkdir ./model_from_hf/Baichuan-13B/
cd ./model_from_hf/Baichuan-13B/
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ../../
```
4. Weights convert
In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
mkdir baichuan-13B-mt
```shell
mkdir baichuan-13B-mt
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan-13B/ \
--save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan-13B/ \
--save-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-13B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-13B/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan-13B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan-13B/mg2hg/
```
5. Prepare dataset
Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/Baichuan-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
--output-prefix ./dataset/Baichuan-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
```shell
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/Baichuan-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan-13B/ \
--output-prefix ./dataset/Baichuan-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Baichuan-13B pre-training script(Baichuan-13B does not support Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Baichuan-13B/"
DATA_PATH="./dataset/Baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
```
7. Launch Baichuan-13B pre-training script: examples/baichuan/pretrain_baichuan_ptd_13B.sh
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -431,9 +436,7 @@ The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
## Inference
Config baichuan-13B inference script: examples/baichuan/generate_baichuan_13b_ptd.sh
Config baichuan-13B inference script: tasks/inference/generate_baichuan_13b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -442,15 +445,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
CHECKPOINT="./model_weights/Baichuan-13B-Base-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan-13B/"
```
Launch baichuan-13B inference script: examples/baichuan/generate_baichuan_13b_ptd.sh
Launch baichuan-13B inference script: tasks/inference/generate_baichuan_13b_ptd.sh
```bash
bash examples/baichuan/generate_baichuan_13b_ptd.sh
bash tasks/inference/generate_baichuan_13b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan/baichuan_13B_inference.png)
![Inference](../../sources/images/baichuan/baichuan_13B_inference.png)
## Evaluation
@ -466,7 +467,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan/evaluate_baichuan_13B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
```
<table>

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -83,6 +84,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_baichuan_13b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -86,6 +87,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_baichuan_7b.log

View File

@ -1,4 +1,4 @@
# BaiChuan2
# BaiChuan2 $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
@ -35,149 +35,151 @@ Baichuan2-7B 训练的硬件配置如下:
### 脚本
1. 克隆仓库到本地服务器
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. (可选)准备预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重:
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重:
```shell
mkdir ./model_from_hf/Baichuan2-7B/
cd ./model_from_hf/Baichuan2-7B/
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan2-7B/
cd ./model_from_hf/Baichuan2-7B/
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
cd ../../
```
4. 数据转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-7B/ \
--save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-7B/ \
--save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan2-7B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan2-7B/mg2hg/
```
5. 准备数据集
从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集:
从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集:
```shell
# 下载数据集
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# 下载数据集
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 准备数据集
mkdir ./dataset/Baichuan2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
--output-prefix ./dataset/Baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# 准备数据集
mkdir ./dataset/Baichuan2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
--output-prefix ./dataset/Baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,权重,词表等路径
CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
```
# 修改数据集,权重,词表等路径
CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
```
7. 启动 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```shell
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -194,7 +196,7 @@ Baichuan2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
## 推理
首先需要配置baichuan2-7B的推理脚本: examples/baichuan2/generate_baichuan2_7b_ptd.sh
首先需要配置baichuan2-7B的推理脚本: tasks/inference/generate_baichuan2_7b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
@ -208,12 +210,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan2-7B/"
然后可直接启动generate_baichuan2_7b_ptd.sh
```bash
bash examples/baichuan2/generate_baichuan2_7b_ptd.sh
bash tasks/inference/generate_baichuan2_7b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_7B_inference.png)
![Inference](../../sources/images/baichuan2/baichuan2_7B_inference.png)
## 评估
@ -229,7 +230,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan2/evaluate_baichuan2_7B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan2_7B_ptd.sh
```
<table>
@ -265,148 +266,150 @@ Baichuan2-13B 训练的硬件配置如下:
### 脚本
1. 克隆仓库到本地服务器
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. (可选的)准备预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 下载预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main) 下载预训练权重
```shell
mkdir ./model_from_hf/Baichuan2-13B/
cd ./model_from_hf/Baichuan2-13B/
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan2-13B/
cd ./model_from_hf/Baichuan2-13B/
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
cd ../../
```
4. 权重转换
将 BaiChuan2-13B 模型权重从 huggingface 格式转换为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将 BaiChuan2-13B 模型权重从 huggingface 格式转换为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-13B/ \
--save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-13B/ \
--save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-13B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan2-13B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-13B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Baichuan2-13B/mg2hg/
```
5. 准备数据集
下载 Baichuan2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
下载 Baichuan2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/Baichuan2-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
--output-prefix ./dataset/Baichuan2-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
mkdir ./dataset/Baichuan2-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
--output-prefix ./dataset/Baichuan2-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Baichuan2-13B 训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改词表,数据集, 权重等路径等路径
CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
```
# 修改词表,数据集, 权重等路径等路径
CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
```
7. 启动 Baichuan2-13B 训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```bash
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```bash
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -415,15 +418,15 @@ Baichuan2-13B 训练的硬件配置如下:
Baichuan2-13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|:----:|:-------------------------:|:----:|:----------------:|:--------------------:|:---------------:|
| NPUs | Baichuan2-13B | 1000 | - | 1668 | - |
| 参考 | Baichuan2-13B | - | - | 2062 | - |
|:----:|:-------------------------:|:----:|:------------------:|:--------------------:|:---------------:|
| NPUs | Baichuan2-13B | 1000 |1.83| 1310 | 4.35 |
| 参考 | Baichuan2-13B | - | - | 872 |- |
## 推理
首先需要配置baichuan2-13B的推理脚本: examples/baichuan2/generate_baichuan2_13b_ptd.sh
首先需要配置baichuan2-13B的推理脚本: tasks/inference/generate_baichuan2_13b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
@ -437,12 +440,11 @@ TOKENIZER_PATH="./model_from_hf/Baichuan2-13B/"
然后可直接启动generate_baichuan2_13b_ptd.sh
```bash
bash examples/baichuan2/generate_baichuan2_13b_ptd.sh
bash tasks/inference/generate_baichuan2_13b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_13B_inference.png)
![Inference](../../sources/images/baichuan2/baichuan2_13B_inference.png)
## 评估
@ -458,7 +460,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
```
<table>

View File

@ -1,4 +1,4 @@
# BaiChuan2
# BaiChuan2 $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
@ -37,146 +37,148 @@ Here's a hardware summary of pre-training Baichuan2-7B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)
Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)
```shell
mkdir ./model_from_hf/Baichuan2-7B/
cd ./model_from_hf/Baichuan2-7B/
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan2-7B/
cd ./model_from_hf/Baichuan2-7B/
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
cd ../../
```
4. Weights convert
In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-7B/ \
--save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-7B/ \
--save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-7B/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-7B/mg2hg/
```
5. Prepare dataset
Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# download datasets
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Baichuan2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
--output-prefix ./dataset/Baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# process datasets
mkdir ./dataset/Baichuan2-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
--output-prefix ./dataset/Baichuan2-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Baichuan2-7B pre-training script : examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
```
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
```
7. Launch Baichuan2-7B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```shell
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -192,9 +194,7 @@ The performance of Baichuan2-7B in **Ascend NPU** and **Reference**:
## Inference
Config baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh
Config baichuan2-7B inference script: tasks/inference/generate_baichuan2_7b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -203,15 +203,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
CHECKPOINT="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan2-7B/"
```
Launch baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh
Launch baichuan2-7B inference script: tasks/inference/generate_baichuan2_7b_ptd.sh
```bash
bash examples/baichuan2/generate_baichuan2_7b_ptd.sh
bash tasks/inference/generate_baichuan2_7b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_7B_inference.png)
![Inference](../../sources/images/baichuan2/baichuan2_7B_inference.png)
## Evaluation
@ -227,7 +225,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
```
<table>
@ -267,145 +265,147 @@ Here's a hardware summary of pre-training Baichuan2-13B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the Baichuan2-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main)
Download the Baichuan2-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/tree/main)
```shell
mkdir ./model_from_hf/Baichuan2-13B/
cd ./model_from_hf/Baichuan2-13B/
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
cd ../../
```
```shell
mkdir ./model_from_hf/Baichuan2-13B/
cd ./model_from_hf/Baichuan2-13B/
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
cd ../../
```
4. Weights convert
In order to adapt to the baichuan2-13B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
In order to adapt to the baichuan2-13B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-13B/ \
--save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./model_from_hf/Baichuan2-13B/ \
--save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
--params-dtype bf16 \
--w-pack True
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-13B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-13B/mg2hg/
```
```shell
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ./model_from_hf/Baichuan2-13B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-13B/mg2hg/
```
5. Prepare dataset
Download the Baichuan2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the Baichuan2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/Baichuan2-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
--output-prefix ./dataset/Baichuan2-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
mkdir ./dataset/Baichuan2-13B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
--output-prefix ./dataset/Baichuan2-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
```
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
```
7. Launch Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```bash
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```bash
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -414,16 +414,14 @@ Here's a hardware summary of pre-training Baichuan2-13B:
The performance of the Baichuan2-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) |
|:----:|:-------------------------:|:----:|:-----------------------------:|:----------------------------:|:-------------------------:|
| NPUs | Baichuan2-13B |1000 | - | 1668 | - |
| Reference | Baichuan2-13B |-| - | 2062 | - |
|:----:|:-------------------------:|:----:|:------------------:|:----------------------------:|:---------------:|
| NPUs | Baichuan2-13B |1000 |1.83| 1310 | 4.35 |
| Reference | Baichuan2-13B |-|-| 872 |- |
## Inference
Config baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh
Config baichuan2-13B inference script: tasks/inference/generate_baichuan2_13b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -432,15 +430,13 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
CHECKPOINT="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan2-13B/"
```
Launch baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh
Launch baichuan2-13B inference script: tasks/inference/generate_baichuan2_13b_ptd.sh
```bash
bash examples/baichuan2/generate_baichuan2_13b_ptd.sh
bash tasks/inference/generate_baichuan2_13b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/baichuan2/baichuan2_13B_inference.png)
![Inference](../../sources/images/baichuan2/baichuan2_13B_inference.png)
## Evaluation
@ -456,7 +452,7 @@ TASK="boolq"
```
```shell
bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh
bash ./tasks/evaluation/evaluate_baichuan2_13B_ptd.sh
```
<table>

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -39,8 +40,8 @@ GPT_ARGS="
--seq-length 4096 \
--disable-bias-linear \
--max-position-embeddings 4096 \
--micro-batch-size 2 \
--global-batch-size 128 \
--micro-batch-size 1 \
--global-batch-size 8 \
--untie-embeddings-and-output-weights \
--no-gradient-accumulation-fusion \
--make-vocab-size-divisible-by 32 \
@ -56,8 +57,6 @@ GPT_ARGS="
--normalization RMSNorm \
--use-fused-rmsnorm \
--use-flash-attn \
--use-fused-swiglu \
--use-mc2 \
--swiglu \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
@ -73,7 +72,7 @@ GPT_ARGS="
--adam-eps 1.0e-8 \
--no-load-optim \
--no-load-rng \
--bf16
--fp16
"
DATA_ARGS="
@ -93,6 +92,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_baichuan2_13b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -85,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_baichuan2_7b.log

View File

@ -1,4 +1,4 @@
# Bloom
# Bloom $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
@ -6,6 +6,8 @@
</p>
</p>
[toc]
# Bloom-7B
## 训练
@ -18,138 +20,140 @@ Bloom-7B 训练的硬件配置如下:
### 脚本
1. 克隆仓库到本地服务器
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. 准备预训练权重
首先下载 Bloom-7B 的 [权重](https://huggingface.co/bigscience/bloom-7b1/tree/main)
首先下载 Bloom-7B 的 [权重](https://huggingface.co/bigscience/bloom-7b1/tree/main)
```shell
mkdir ./model_from_hf/Bloom-7B/
cd ./model_from_hf/Bloom-7B/
cd tokenizer
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
```shell
mkdir ./model_from_hf/Bloom-7B/
cd ./model_from_hf/Bloom-7B/
cd tokenizer
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
4. 权重转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Bloom-7B/ \
--save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--tokenizer-model None
```
```shell
python tools/checkpoint/util.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Bloom-7B/ \
--save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--tokenizer-model None
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--save-dir ./model_from_hf/Bloom-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Bloom-7B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--save-dir ./model_from_hf/Bloom-7B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Bloom-7B/mg2hg/
```
5. 准备数据集
下载 Bloom 7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
下载 Bloom 7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# 下载数据
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Bloom-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
--output-prefix ./dataset/Bloom-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# 处理数据
mkdir ./dataset/Bloom-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
--output-prefix ./dataset/Bloom-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Bloom-7B 预训练脚本(Bloom-7B暂不支持Flash Attention): examples/bloom/pretrain_bloom_ptd_7B.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
```
7. 启动 Bloom-7B 预训练脚本: examples/bloom/pretrain_bloom_ptd_7B.sh
```shell
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -165,9 +169,7 @@ Bloom-7B
## Bloom-7B推理
首先配置Bloom-7B 推理脚本: examples/bloom/generate_bloom_ptd_7B.sh
首先配置Bloom-7B 推理脚本: tasks/inference/generate_bloom_ptd_7B.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -180,16 +182,16 @@ TOKENIZER_PATH="./model_from_hf/Bloom-7B-Base/"
然后可直接启动generate_bloom_7b_ptd.sh
```bash
bash examples/bloom/generate_bloom_7b_ptd.sh
bash tasks/inference/generate_bloom_7b_ptd.sh
```
推理示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom7b-generate.png)
![Inference](../../sources/images/bloom/bloom7b-generate.png)
## Bloom-7B评测
配置Bloom-7B 评估脚本: examples/bloom/evaluate_bloom_7b_ptd.sh
配置Bloom-7B 评估脚本: tasks/evaluation/evaluate_bloom_7b_ptd.sh
```bash
# ascend-toolkit 路径
@ -206,7 +208,7 @@ TASK="your task"
启动评估
```bash
bash examples/bloom/evaluate_bloom_7B_ptd.sh
bash tasks/evaluation/evaluate_bloom_7B_ptd.sh
```
MMLU评测得分
@ -233,141 +235,142 @@ Bloom-176B 训练的硬件配置:
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. 准备预训练权重
下载 Bloom-176B [权重](https://huggingface.co/bigscience/bloom/tree/main)
下载 Bloom-176B [权重](https://huggingface.co/bigscience/bloom/tree/main)
```shell
mkdir ./model_from_hf/Bloom-176B/
cd ./model_from_hf/Bloom-176B/
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
```shell
mkdir ./model_from_hf/Bloom-176B/
cd ./model_from_hf/Bloom-176B/
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
4. 权重转换
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 5 \
--load-dir ./model_from_hf/Bloom-176B/ \
--save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--tokenizer-model None \
--params-dtype bf16
# config.json中同字段对应的key值与其他模型不一致将文件中的n_embed改为hidden_size 将num_attention_heads修改为n_head。
```
```shell
python tools/checkpoint/util.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 5 \
--load-dir ./model_from_hf/Bloom-176B/ \
--save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--tokenizer-model None \
--params-dtype bf16
# config.json中同字段对应的key值与其他模型不一致将文件中的n_embed改为hidden_size 将num_attention_heads修改为n_head。
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--params-dtype bf16 \
--save-dir ./model_from_hf/Bloom-176B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Bloom-176B/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--params-dtype bf16 \
--save-dir ./model_from_hf/Bloom-176B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Bloom-176B/mg2hg/
```
5. 准备数据集
下载 Bloom 176B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
下载 Bloom 176B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# 下载数据
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Bloom-176B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
--output-prefix ./dataset/Bloom-176B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# 处理数据
mkdir ./dataset/Bloom-176B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
--output-prefix ./dataset/Bloom-176B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. 配置 Bloom-176B 预训练脚本(Bloom-176B暂不支持Flash Attention): examples/bloom/pretrain_bloom_176b.sh
```shell
# 修改 MASTER_ADDR 为主节点
MASTER_ADDR=localhost
```shell
# 修改 MASTER_ADDR 为主节点
MASTER_ADDR=localhost
# 修改每个节点的节点序号,主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
NODE_RANK=0
# 修改每个节点的节点序号,主节点序号为 0, 其余节点的序号依次增长到集群节点数量-1
NODE_RANK=0
# 修改数据集路径和词表路径
TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
```
# 修改数据集路径和词表路径
TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
```
7. 启动 Bloom-176B 预训练脚本: examples/bloom/pretrain_bloom_176b.sh
在集群中的每个节点上启动 examples/bloom/pretrain_bloom_176b.sh 脚本
在集群中的每个节点上启动 examples/bloom/pretrain_bloom_176b.sh 脚本
```shell
bash examples/bloom/pretrain_bloom_176b.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/bloom/pretrain_bloom_176b.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
## 性能
@ -382,8 +385,7 @@ Bloom-176B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
## 推理
首先配置Bloom-176B 推理脚本: examples/bloom/generate_bloom_ptd_176B.sh
首先配置Bloom-176B 推理脚本: tasks/inference/generate_bloom_ptd_176B.sh
bloom 176b的推理需要5机因此要用上面的 权重转换脚本重新切分tp=8pp=5
```bash
@ -398,16 +400,16 @@ TOKENIZER_PATH="./model_from_hf/Bloom-176B/"
然后可直接启动generate_bloom_176b_ptd.sh
```bash
bash examples/bloom/generate_bloom_176b_ptd.sh
bash tasks/inference/generate_bloom_176b_ptd.sh
```
推理示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom176b-generate.png)
![Inference](../../sources/images/bloom/bloom176b-generate.png)
## 评估
配置Bloom-176B 评估脚本: examples/bloom/evaluate_bloom_176b_ptd.sh
配置Bloom-176B 评估脚本: tasks/evaluation/evaluate_bloom_176b_ptd.sh
```bash
# ascend-toolkit 路径
@ -424,7 +426,7 @@ TASK="your task"
启动评估
```bash
bash examples/bloom/evaluate_bloom_176B_ptd.sh
bash tasks/evaluation/evaluate_bloom_176B_ptd.sh
```
评测得分

View File

@ -1,4 +1,4 @@
# Bloom
# Bloom $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
@ -6,6 +6,7 @@
</p>
</p>
[toc]
# Bloom-7B
@ -21,135 +22,137 @@ Here's a hardware summary of pre-training Bloom-7B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the Bloom-7B checkpoint from [here](https://huggingface.co/bigscience/bloom-7b1/tree/main)
Download the Bloom-7B checkpoint from [here](https://huggingface.co/bigscience/bloom-7b1/tree/main)
```shell
mkdir ./model_from_hf/Bloom-7B/
cd ./model_from_hf/Bloom-7B/
cd tokenizer
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
```shell
mkdir ./model_from_hf/Bloom-7B/
cd ./model_from_hf/Bloom-7B/
cd tokenizer
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
4. Weights convert
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Bloom-7B/ \
--save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--tokenizer-model None
```
```shell
python tools/checkpoint/util.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Bloom-7B/ \
--save-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--tokenizer-model None
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--save-dir ./model_from_hf/Bloom-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-7B/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--save-dir ./model_from_hf/Bloom-7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-7B/mg2hg/
```
5. Prepare dataset
Download the Bloom-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the Bloom-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# download datasets
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# prepare datasets
mkdir ./dataset/Bloom-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
--output-prefix ./dataset/Bloom-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# prepare datasets
mkdir ./dataset/Bloom-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-7B/ \
--output-prefix ./dataset/Bloom-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Bloom-7B pre-training script(Bloom-7B does not support Flash Attention) : examples/bloom/pretrain_bloom_ptd_7B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
```
CKPT_SAVE_DIR="./ckpt/Bloom-7B/"
DATA_PATH="./dataset/Bloom-7B/alpaca_text_document"
TOKENIZER_PATH="./model_from_hf/Bloom-7B/"
CKPT_LOAD_DIR="./model_weights/Bloom-7B-v0.1-tp8-pp1/"
```
7. Launch Bloom-7B pre-training script: examples/bloom/pretrain_bloom_ptd_7B.sh
```shell
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/bloom/pretrain_bloom_ptd_7B.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -166,9 +169,7 @@ The performance of Bloom-7B in **Ascend NPU** and **Reference**:
## Inference Bloom-7B
Config Bloom-7B inference script: examples/bloom/generate_bloom_7b_ptd.sh
Config Bloom-7B inference script: tasks/inference/generate_bloom_7b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -177,20 +178,17 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
CHECKPOINT="./model_weights/Bloom-7B-Base-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Bloom-7B-Base/"
```
Launch Bloom-7B inference script: examples/bloom/generate_bloom_7b_ptd.sh
Launch Bloom-7B inference script: tasks/inference/generate_bloom_7b_ptd.sh
```bash
bash examples/bloom/generate_bloom_7b_ptd.sh
bash tasks/inference/generate_bloom_7b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom7b-generate.png)
![Inference](../../sources/images/bloom/bloom7b-generate.png)
## Evaluation Bloom-7B
Config Bloom-7B evaluation script: examples/bloom/evaluate_bloom_7B_ptd.sh
Config Bloom-7B evaluation script: tasks/evaluation/evaluate_bloom_7B_ptd.sh
```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -206,7 +204,7 @@ TASK="your task"
Launch Bloom-7B evaluation script:
```bash
bash examples/bloom/evaluate_bloom_7B_ptd.sh
bash tasks/evaluation/evaluate_bloom_7B_ptd.sh
```
Evaluation results
@ -236,142 +234,144 @@ Here's a hardware summary of pre-training Bloom-176B:
1. Clone the repository to your local server
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build enviroment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the Bloom-176B tokensizer from [here](https://huggingface.co/bigscience/bloom/tree/main).
Download the Bloom-176B tokensizer from [here](https://huggingface.co/bigscience/bloom/tree/main).
```shell
mkdir ./model_from_hf/Bloom-176B/
cd ./model_from_hf/Bloom-176B/
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
```shell
mkdir ./model_from_hf/Bloom-176B/
cd ./model_from_hf/Bloom-176B/
wget https://huggingface.co/bigscience/bloom/resolve/main/special_tokens_map.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer.json
wget https://huggingface.co/bigscience/bloom/resolve/main/tokenizer_config.json
...
cd ../../
```
4. Weights convert
5. Weights convert
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
HuggingFace weights --> Megatron weights
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 5 \
--load-dir ./model_from_hf/Bloom-176B/ \
--save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--tokenizer-model None \
--params-dtype bf16
```
```shell
python tools/checkpoint/util.py \
--model-type GPT \
--loader loader_bloom_hf \
--saver saver_megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 5 \
--load-dir ./model_from_hf/Bloom-176B/ \
--save-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--tokenizer-model None \
--params-dtype bf16
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--params-dtype bf16 \
--save-dir ./model_from_hf/Bloom-176B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-176B/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Bloom-176B-v0.1-pt8-pp5/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--embed-layernorm \
--params-dtype bf16 \
--save-dir ./model_from_hf/Bloom-176B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Bloom-176B/mg2hg/
```
5. Prepare dataset
Download the bloom-176b datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
Download the bloom-176b datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```shell
# download datasets
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Bloom-176B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
--output-prefix ./dataset/Bloom-176B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
# process datasets
mkdir ./dataset/Bloom-176B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Bloom-176B/ \
--output-prefix ./dataset/Bloom-176B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
6. Config Bloom-176B pre-training script(Bloom-176B does not support Flash Attention): examples/bloom/pretrain_bloom_176b.sh
```shell
# modify MASTER_ADDR to the IP address of the master node in the cluster.
# the master node is localhost, and the other nodes are the IP address of the master node
MASTER_ADDR=localhost
```shell
# modify MASTER_ADDR to the IP address of the master node in the cluster.
# the master node is localhost, and the other nodes are the IP address of the master node
MASTER_ADDR=localhost
# modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
NODE_RANK=0
# modify the rank number of a node. The rank number of the master node is 0, and the rank number of other nodes increases in ascending order.
NODE_RANK=0
# modify the datasets path and tokenizer path
TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
```
# modify the datasets path and tokenizer path
TOKENIZER_NAME_OR_PATH=./model_from_hf/Bloom-176B/
DATA_PATH=./dataset/Bloom-176B/alpaca_text_document
```
7. Launch Bloom-176B pre-training script: examples/bloom/pretrain_bloom_176b.sh
Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
Run the examples/bloom/pretrain_bloom_176b.sh on all nodes in the cluster.
```shell
bash examples/bloom/pretrain_bloom_176b.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/bloom/pretrain_bloom_176b.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
## Performance
@ -386,9 +386,7 @@ The performance of Bloom-176B in **Ascend NPU** and **Reference**:
## Inference Bloom 176B
Config Bloom-176B inference script: examples/bloom/generate_bloom_176b_ptd.sh
Config Bloom-176B inference script: tasks/inference/generate_bloom_176b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -397,23 +395,20 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
CHECKPOINT="./model_weights/Bloom-176B-v0.1-tp8-pp5/"
TOKENIZER_PATH="./model_from_hf/Bloom-176B/"
```
Launch Bloom-176B inference script: examples/bloom/generate_bloom_176b_ptd.sh
Launch Bloom-176B inference script: tasks/inference/generate_bloom_176b_ptd.sh
Bloom-176b needs 5 machines to inference, so you need to convert a new model, set
tp=8, pp=5
```bash
bash examples/bloom/generate_bloom_176b_ptd.sh
bash tasks/inference/generate_bloom_176b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/bloom/bloom176b-generate.png)
![Inference](../../sources/images/bloom/bloom176b-generate.png)
## Evaluation Bloom 176B
Config Bloom-176B evaluation script: examples/bloom/evaluate_bloom_176B_ptd.sh
Config Bloom-176B evaluation script: tasks/evaluation/evaluate_bloom_176B_ptd.sh
```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
@ -429,7 +424,7 @@ TASK="your task"
Launch Bloom-176B evaluation script:
```bash
bash examples/bloom/evaluate_bloom_176B_ptd.sh
bash tasks/evaluation/evaluate_bloom_176B_ptd.sh
```
Evaluation results

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HCCL_CONNECT_TIMEOUT=1200
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -49,6 +50,7 @@ GPT_ARGS="
--lr 1.2e-4 \
--train-iters 5000 \
--init-method-std 0.0048 \
--optimize-recomp-communication-level 2 \
--hidden-dropout 0.0 \
--position-embedding-type alibi \
--normalization LayerNorm \
@ -87,6 +89,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_bloom_176b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HCCL_CONNECT_TIMEOUT=1200
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -44,16 +45,14 @@ GPT_ARGS="
--make-vocab-size-divisible-by 1 \
--attention-softmax-in-fp32 \
--apply-query-key-layer-scaling \
--lr 1.2e-6 \
--train-iters 2000 \
--lr 1.2e-4 \
--train-iters 200 \
--init-method-std 0.0048 \
--hidden-dropout 0.0 \
--attention-dropout 0.0 \
--position-embedding-type alibi \
--normalization LayerNorm \
--min-lr 1e-8 \
--lr-decay-iters 430000 \
--lr-decay-style cosine \
--min-lr 6e-6 \
--lr-decay-iters 200 \
--weight-decay 1e-1 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
@ -83,6 +82,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_bloom_7b.log

View File

@ -1,296 +0,0 @@
# ChatGLM3
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
# 目录
- [ChatGLM3](#ChatGLM3)
- [目录](#目录)
- [ChatGLM3-6B](#ChatGLM3-6B)
- [训练-6B](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理-6B](#推理-6B)
- [评估-6B](#评估-6B)
# ChatGLM3-6B
## 训练
ChatGLM3-6B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 8 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 ChatGLM3-6B 的 [预训练权重和词表](https://huggingface.co/THUDM/chatglm3-6b/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/chatglm3_6b_hf/
cd ./model_from_hf/chatglm3_6b_hf/
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/config.json
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/configuration_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/modeling_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00001-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00002-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00003-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00004-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00005-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00006-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00007-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/quantization.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenization_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer.model
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer_config.json
cd ../../
```
4. 权重转换
4.1 将权重从 huggingface 格式转化为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 权重格式转换
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader chatglm3_hf \
--saver megatron \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/chatglm3_6b_hf/ \
--save-dir ./model_weights/chatglm3_6b_tp1pp2/
--tokenizer-model ./model_from_hf/chatglm3_6b_hf/tokenizer.model \
--add-qkv-bias
```
注意chatglm3的--target-tensor-parallel-size跟config.json中的multi_query_attention配置有关这里multi_query_attention设置的是2。
4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_chatglm3 \
--load-dir ./model_weights/chatglm3_6b_tp1pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--save-dir ./model_from_hf/chatglm3_6b_hf/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/chatglm3_6b_hf/mg2hg/
```
5. 预训练
5.1 准备数据集
下载 ChatGLM3-6B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/chatglm3_6b_hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
--output-prefix ./dataset/chatglm3_6b_hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 用ptd模式预训练
配置ChatGLM3-6B PTD 预训练脚本: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数加载和保存路径
LOAD_CHECKPOINT_PATH="./model_weights/chatglm3_6b_tp1pp2/"
SAVE_CHECKPOINT_PATH="./ckpt/chatglm3_6b_hf/"
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/" #词表路径
DATA_PATH="./dataset/chatglm3_6b_hf/alpaca_text_document" #数据集路径
```
多机运行增加参数--overlap-grad-reduce
启动 ChatGLM3-6B PTD预训练脚本: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```shell
bash examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/chatglm3-6b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
--output-prefix ./finetune_dataset/chatglm3-6b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
增加微调参数--finetune增加权重加载参数--load使微调从第一步开始。使用--tokenizer-padding-side left。修改tokenizer参数更改为以下参数
```bash
DATA_PATH="./finetune_dataset/chatglm3-6b-hf/alpaca"
TOKENIZER_PATH="./model_from_hf/chatglm3-6b-hf/"
CKPT_LOAD_DIR="./model_weights/chatglm3_6b_tp1pp2/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-padding-side left \
--tokenizer-type PretrainedFromHF \
--tokenizer-not-use-fast \
```
启动 ChatGLM3-6B 全参微调脚本: examples/chatglm3/tune_chatglm3_6B_8K.sh
```shell
bash examples/chatglm3/tune_chatglm3_6B_8K.sh
```
### 性能
#### 吞吐
ChatGLM3-6B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 序列长度 |tokens吞吐 (tokens/s/p) |
| :--: | :--------: |:---------------------:|
| NPUs | ChatGLM3-6B | 8192 | 4297 |
| 参考 | ChatGLM3-6B | 8192 | 4269 |
## 推理
我们在ChatGLM3_6B中支持推理来生成文本。
推理不同于预训练,比如我们需要加载预训练检查点和输出样本的长度:
配置 ChatGLM3-6B 推理脚本: examples/chatglm3/generate_chatglm3_6B.sh
```shell
# 修改模型权重路径以及词表路径
CHECKPOINT="./model_weights/chatglm3_6b_tp1pp2/"
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"
```
启动推理脚本
```shell
bash ./examples/chatglm3/generate_chatglm3_6B.sh
```
推理结果示例如下:
![ChatGLM3-6B-generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/chatglm3/ChatGLM3-6B-generate.png)
## 评估
使用mmlu基准来评估模型。mmlu基准[下载](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU/data/test).
因评估代码限制,参考 4.1权重转换 设置--target-tensor-parallel-size 2 --target-pipeline-parallel-size 4做权重转换保存新权重到chatglm3_6b_tp2pp4目录。
配置chatglm3-6b评估脚本: examples/chatglm3/evaluate_chatglm3_6B.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/" #词表路径
CHECKPOINT="./model_weights/chatglm3_6b_tp2pp4/" #模型路径
# 配置任务和数据集路径
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/chatglm3/evaluate_chatglm3_6B.sh
```
| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
|:---:|:---:|:---:|:-----------------------------------------:|:------:|
| MMLU | 57 | 14042 | [61.4](https://github.com/THUDM/ChatGLM3) | 61.5 |

View File

@ -1,291 +0,0 @@
# ChatGLM
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
</p>
# Contents
- [ChatGLM3](#ChatGLM3)
- [Contents](#contents)
- [ChatGLM3-6B](#ChatGLM3-6b)
- [Training-6B](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference-6B](#inference-6b)
- [Evaluation-6B](#evaluation-6b)
# ChatGLM3-6B
## Training
Here's a hardware summary of pre-training ChatGLM3-6B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights and tokenizer
Download the ChatGLM3-6B checkpoint from [here](https://huggingface.co/THUDM/chatglm3-6b/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/chatglm3_6b_hf/
cd ./model_from_hf/chatglm3_6b_hf/
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/config.json
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/configuration_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/modeling_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00001-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00002-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00003-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00004-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00005-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00006-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model-00007-of-00007.bin
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/quantization.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenization_chatglm.py
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer.model
wget https://huggingface.co/THUDM/chatglm3-6b/resolve/main/tokenizer_config.json
cd ../../
```
4. weight conversion in ptd mode
4.1 Convert weights from HuggingFace format to Megatron format
***This scenario is generally used to enable the open-source HuggingFace model to be trained on Megatron***
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# convert to ptd weights
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader chatglm3_hf \
--saver megatron \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/chatglm3_6b_hf/ \
--save-dir ./model_weights/chatglm3_6b_tp1pp2/ \
--tokenizer-model ./model_from_hf/chatglm3_6b_hf/tokenizer.model \
--add-qkv-bias
```
Note: The --target-tensor-parallel-size of chatglm3 is related to the multi_query_attention configuration in the config.json, and the multi_query_attention set here is 2.
4.2 Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_chatglm3 \
--load-dir ./model_weights/chatglm3_6b_tp1pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--save-dir ./model_from_hf/chatglm3_6b_hf/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/chatglm3_6b_hf/mg2hg/
```
5. pre-training
5.1 Prepare dataset
Download the ChatGLM3-6B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/chatglm3_6b_hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
--output-prefix ./dataset/chatglm3_6b_hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 pre-training using ptd mode
Config ChatGLM3-6B pre-training script: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
LOAD_CHECKPOINT_PATH="./model_weights/chatglm3_6b_tp1pp2/"
SAVE_CHECKPOINT_PATH="./ckpt/chatglm3_6b_hf/"
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/" #tokenizer path
DATA_PATH="./dataset/chatglm3_6b_hf/alpaca_text_document" #processed dataset
```
Multi-machine training requires the addition of parameter --overlap-grad-reduce
Launch ChatGLM3-6B pre-training script: examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```shell
bash examples/chatglm3/pretrain_chatglm3_6B_8K.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
6. fine-tuning
6.1 Prepare fine-tuning dataset
Download the alpaca datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/chatglm3-6b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/chatglm3_6b_hf/ \
--output-prefix ./finetune_dataset/chatglm3-6b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_chatglm3_6B_8K.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step. Use --tokenizer-padding-side left.
```bash
DATA_PATH="./finetune_dataset/chatglm3-6b-hf/alpaca"
TOKENIZER_PATH="./model_from_hf/chatglm3-6b-hf/"
CKPT_LOAD_DIR="./model_weights/chatglm3_6b_tp1pp2/"
--load ${CKPT_LOAD_DIR} \
--finetune \
--is-instruction-dataset \
--tokenizer-padding-side left \
--tokenizer-type PretrainedFromHF \
--tokenizer-not-use-fast \
```
Launch ChatGLM3-6B finetune script: examples/chatglm3/tune_chatglm3_6B_8K.sh
```shell
bash examples/chatglm3/tune_chatglm3_6B_8K.sh
```
### Performance
#### Machine performance
The performance of ChatGLM3-6B in **Ascend NPU** and **Reference**:
| Device | Model | sequence length | throughput rate (tokens/s/p) |
| :--: | :--------: | :--------:|:---------------------:|
| NPUs | ChatGLM3-6B | 8192 | 4297 |
| Reference | ChatGLM3-6B | 8192 | 4269 |
## Inference
We support Inference for text generation with ChatGLM3_6B.
Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
Config ChatGLM3-6B inference script: examples/chatglm3/generate_chatglm3_6B.sh
```shell
# modify the model weight path and tokenizer path
CHECKPOINT="./model_weights/chatglm3_6b_tp1pp2/"
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/"
```
Launch ChatGLM3-6B inference script.
```shell
bash ./examples/chatglm3/generate_chatglm3_6B.sh
```
Some inference samples are as follows:
![ChatGLM3-6B-generate.png](https://gitee.com/ascend/ModelLink/raw/master/sources/images/chatglm3/ChatGLM3-6B-generate.png)
## Evaluation
Use mmlu benchmark to evaluate our model. MMLU benchmark Download [here](https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU/data/test).
Config chatglm3-6b evaluation script: examples/chatglm3/evaluate_chatglm3_6B.sh
```bash
# ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model parameter path and vocabulary path
TOKENIZER_PATH="./model_from_hf/chatglm3_6b_hf/" # vocabulary path
CHECKPOINT="./model_weights/chatglm3_6b_tp2pp4/" # parameter path
# Configure the task type and dataset path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Launch chatglm3-6b evaluation
```bash
bash examples/chatglm3/evaluate_chatglm3_6B.sh
```
| Task | Subset | Question | OpenSource | NPU |
|:---:|:---:|:---:|:-----------------------------------------:|:------:|
| MMLU | 57 | 14042 | [61.4](https://github.com/THUDM/ChatGLM3) | 61.5 |

View File

@ -1,59 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
DATA_PATH="./mmlu/data/test"
TASK="mmlu"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--seq-length 8192 \
--max-new-tokens 1 \
--max-position-embeddings 32768 \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 4 \
--num-layers 28 \
--hidden-size 4096 \
--ffn-hidden-size 13696 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 2 \
--disable-bias-linear \
--add-qkv-bias \
--swiglu \
--padded-vocab-size 65024 \
--make-vocab-size-divisible-by 1 \
--position-embedding-type rope \
--use-partial-rope \
--load $CHECKPOINT \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--untie-embeddings-and-output-weights \
--seed 42 \
| tee logs/eval_chatglm3_6B_${TASK}.log

View File

@ -1,62 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
TOKENIZER_MODEL="your tokenizer.model file path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=2
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--num-layers 28 \
--hidden-size 4096 \
--ffn-hidden-size 13696 \
--seq-length 8192 \
--group-query-attention \
--num-query-groups 2 \
--num-attention-heads 32 \
--padded-vocab-size 65024 \
--make-vocab-size-divisible-by 1 \
--max-position-embeddings 32768 \
--position-embedding-type rope \
--use-partial-rope \
--disable-bias-linear \
--add-qkv-bias \
--swiglu \
--normalization RMSNorm \
--max-new-tokens 256 \
--micro-batch-size 1 \
--global-batch-size 16 \
--load "${CHECKPOINT}" \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path "${TOKENIZER_PATH}" \
--tokenizer-model "${TOKENIZER_MODEL}" \
--tokenizer-not-use-fast \
--untie-embeddings-and-output-weights \
--attention-softmax-in-fp32 \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--no-gradient-accumulation-fusion \
--exit-on-missing-checkpoint \
--seed 42 \
--fp16 \
| tee logs/generate_chatglm3_6B.log

View File

@ -1,96 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_PATH="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=1
PP=2
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--num-layers 28 \
--hidden-size 4096 \
--ffn-hidden-size 13696 \
--num-attention-heads 32 \
--seq-length 8192 \
--micro-batch-size 1 \
--global-batch-size 128 \
--max-position-embeddings 32768 \
--padded-vocab-size 65024 \
--make-vocab-size-divisible-by 1 \
--group-query-attention \
--num-query-groups 2 \
--disable-bias-linear \
--add-qkv-bias \
--position-embedding-type rope \
--use-partial-rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--swiglu \
--use-fused-swiglu \
--use-flash-attn \
--use-distributed-optimizer \
--use-mc2 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--lr 1e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1e-8 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--initial-loss-scale 4096 \
--adam-beta2 0.95 \
--no-gradient-accumulation-fusion \
--load ${CKPT_LOAD_DIR} \
--no-load-optim \
--no-load-rng \
--fp16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 1000 \
--eval-iters 10 \
"
python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--save $CKPT_SAVE_DIR \
| tee logs/train_chatglm3_6B_8K.log

View File

@ -1,99 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_PATH="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=1
PP=2
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--num-layers 28 \
--hidden-size 4096 \
--ffn-hidden-size 13696 \
--num-attention-heads 32 \
--seq-length 8192 \
--micro-batch-size 1 \
--global-batch-size 128 \
--max-position-embeddings 32768 \
--padded-vocab-size 65024 \
--make-vocab-size-divisible-by 1 \
--group-query-attention \
--num-query-groups 2 \
--disable-bias-linear \
--add-qkv-bias \
--position-embedding-type rope \
--use-partial-rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--swiglu \
--use-fused-swiglu \
--use-distributed-optimizer \
--use-mc2 \
--finetune \
--is-instruction-dataset \
--tokenizer-padding-side left \
--tokenizer-type PretrainedFromHF \
--tokenizer-not-use-fast \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--lr 1e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1e-8 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--use-flash-attn \
--adam-beta1 0.9 \
--initial-loss-scale 4096 \
--adam-beta2 0.95 \
--no-gradient-accumulation-fusion \
--load ${CKPT_LOAD_DIR} \
--no-load-optim \
--no-load-rng \
--fp16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 1000 \
--eval-iters 10 \
"
python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--save $CKPT_SAVE_DIR \
| tee logs/tune_chatglm3_6B_8K.log

View File

@ -1,295 +0,0 @@
# CodeLlama $\color{black}{\bf\tiny{【社区贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
</p>
# 目录
- [CodeLlama-34B](#codellama-34b)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理](#推理)
- [评估](#评估)
# CodeLlama-34B
## 训练
CodeLlama-34B 训练的硬件配置如下:
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 16 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.2.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.2.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. (可选的)准备预训练权重
从 [huggingface](https://huggingface.co/codellama/CodeLlama-34b-hf/tree/main) 下载预训练权重
```shell
mkdir ./model_from_hf/CodeLlama-34B/
cd ./model_from_hf/CodeLlama-34B/
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/config.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/generation_config.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00001-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00002-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00003-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00004-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00005-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00006-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00007-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/special_tokens_map.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.model
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer_config.json
cd ../../
```
4. 权重转换
4.1 将 CodeLlama-34B 模型权重从 huggingface 格式转换为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/CodeLlama-34B/ \
--save-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
--tokenizer-model ./model_from_hf/CodeLlama-34B/tokenizer.model \
--params-dtype bf16
```
如果为单机8卡推理或者评估任务将`--target-pipeline-parallel-size`值设为`1`,将`--save-dir`值中的`pp2`改为`pp1`.
4.2 任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/CodeLlama-34B/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/CodeLlama-34B/mg2hg/
```
5. 预训练
5.1 准备数据集
下载 CodeLlama-34B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
mkdir ./dataset/CodeLlama-34B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
--output-prefix ./dataset/CodeLlama-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 预训练
配置 CodeLlama-34B 训练脚本: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
DATA_PATH="./dataset/CodeLlama-34B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/CodeLlama-34B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/"
```
启动 CodeLlama-34B 训练脚本: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```bash
bash examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/CodeLlama-34B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
--output-prefix ./finetune_dataset/CodeLlama-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数`--is-instruction-dataset`和`--padded-vocab-size 32000`*
增加微调参数`--finetune`使微调从第一步开始。修改tokenizer参数去掉`--tokenizer-type Llama2Tokenizer` 和`--tokenizer-model ${TOKENIZER_MODEL}`,更改为以下参数:
```bash
DATA_PATH="./finetune_dataset/CodeLlama-34B/alpaca"
TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/"
--finetune \
--is-instruction-dataset \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--padded-vocab-size 32000 \
```
### 性能
#### 吞吐
CodeLlama-34B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
|:----:|:------------:|:----:|:------------------:|:--------------------:|:---------------:|
| NPUs | CodeLlama-34B | - | 3.27 | 837 | 313 |
| 参考 | CodeLlama-34B | - | 2.97 | 762 | 344 |
## 推理
配置CodeLlama-34B的推理脚本: examples/codellama/generate_codellama_34b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
```
然后可直接启动generate_codellama_34b_ptd.sh
```bash
bash examples/codellama/generate_codellama_34b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/codellama/codellama-34b-generate.png)
## 评估
我们使用human_eval基准来评估我们的模型。基准[下载](https://github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz).
```shell
# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./human_eval/"
TASK="human_eval"
```
```shell
bash ./examples/codellama/evaluate_codellama_34b_ptd.sh
```
<table>
<thead>
<tr>
<th>任务</th>
<th>模型</th>
<th>昇腾值</th>
<th>社区值</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://huggingface.co/datasets/openai_humaneval">human_eval</a></td>
<th>CodelLlama 34B</th>
<td>0.4878</td>
<td><a href="https://paperswithcode.com/sota/code-generation-on-humaneval">0.488</a></td>
</tr>
</tbody>
</table>

View File

@ -1,297 +0,0 @@
# CodeLlama $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
</p>
</p>
# Contents
- [CodeLlama-34B](#codellama-34b)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference](#inference)
- [Evaluation](#evaluation)
# CodeLlama-34B
## Training
Here's a hardware summary of pre-training CodeLlama-34B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 16 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.2.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.2.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights
Download the CodeLlama-34B checkpoint from [here](https://huggingface.co/codellama/CodeLlama-34b-hf/tree/main)
```shell
mkdir ./model_from_hf/CodeLlama-34B/
cd ./model_from_hf/CodeLlama-34B/
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/config.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/generation_config.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00001-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00002-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00003-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00004-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00005-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00006-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model-00007-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/special_tokens_map.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.json
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer.model
wget https://huggingface.co/codellama/CodeLlama-34b-hf/resolve/main/tokenizer_config.json
cd ../../
```
4. Weights convert
4.1 In order to adapt to the CodeLlama-34B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/CodeLlama-34B/ \
--save-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
--tokenizer-model ./model_from_hf/CodeLlama-34B/tokenizer.model \
--params-dtype bf16
```
For inference or evaluation tasks, set the `--target-pipeline-parallel-size` value to `1` and change the `pp2` value to `pp1` in the `--save-dir` value.
4.2 Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/CodeLlama-34B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/CodeLlama-34B/mg2hg/
```
5. Pre-training
5.1 Prepare dataset
Download the CodeLlama-34B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/CodeLlama-34B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
--output-prefix ./dataset/CodeLlama-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 Pre-training
Config CodeLlama-34B pre-training script : examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
DATA_PATH="./dataset/CodeLlama-34B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/CodeLlama-34B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-v0.1-tp8-pp2/"
```
Launch CodeLlama-34B pre-training script: examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```shell
bash examples/codellama/pretrain_codellama_34b_ptd_16p.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
6. Fine-tuning
6.1 Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/CodeLlama-34B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/CodeLlama-34B/ \
--output-prefix ./finetune_dataset/CodeLlama-34B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_codellama_34b_ptd_16p.sh.*The difference is that the dataset and the training parameter `is-instruction-dataset` and `padded-vocab-size 32000` are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
DATA_PATH="./finetune_dataset/CodeLlama-34B/alpaca"
TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
CKPT_SAVE_DIR="./ckpt/CodeLlama-34B/"
CKPT_LOAD_DIR="./model_weights/CodeLlama-34B-Base-v0.1-tp8-pp2/"
--finetune \
--is-instruction-dataset \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--padded-vocab-size 32000 \
```
### Performance
#### Machine performance
The performance of CodeLlama-34B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s) | throughput rate (tokens/s/p) | single-step time (s/step) |
|:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
| NPUs | CodeLlama-34B | - | 3.27 | 837 | 313 |
| Reference | CodeLlama-34B | - | 2.97 | 762 | 344 |
## Inference
Config CodeLlama-34B inference script: examples/codellama/generate_codellama_34b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/CodeLlama-34B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/CodeLlama-34B/"
```
Launch CodeLlama-34B inference script: examples/codellama/generate_codellama_34b_ptd.sh
```bash
bash examples/codellama/generate_codellama_34b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/codellama/codellama-34b-generate.png)
## Evaluation
We use the boolq benchmark to evaluate our model. Benchmark [Download](https://github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz).
```shell
# config origin weight and vocab file path
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# config tasks and dataset path
DATA_PATH="./human_eval/"
TASK="human_eval"
```
```shell
bash ./examples/codellama/evaluate_codellama_34b_ptd.sh
```
<table>
<thead>
<tr>
<th>Task</th>
<th>Model</th>
<th>NPU</th>
<th>OpenSource</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://huggingface.co/datasets/openai_humaneval">human_eval</a></td>
<th>CodelLlama 34B</th>
<td>0.4878</td>
<td><a href="https://paperswithcode.com/sota/code-generation-on-humaneval">0.488</a></td>
</tr>
</tbody>
</table>

View File

@ -1,59 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT="Your ckpt file path"
TOKENIZER_PATH="Your tokenizer path"
DATA_PATH="./human_eval/"
TASK="human_eval"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task $TASK\
--seq-length 4096 \
--max-new-tokens 1024 \
--max-position-embeddings 16384 \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 48 \
--hidden-size 8192 \
--ffn-hidden-size 22016 \
--num-attention-heads 64 \
--disable-bias-linear \
--swiglu \
--position-embedding-type rope \
--load ${CHECKPOINT} \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--use-fused-rmsnorm \
--exit-on-missing-checkpoint \
--padded-vocab-size 32000 \
--no-load-rng \
--no-load-optim \
--untie-embeddings-and-output-weights \
--no-masked-softmax-fusion \
--make-vocab-size-divisible-by 1 \
--group-query-attention \
--num-query-groups 8 \
--rotary-base 1000000 \
--instruction-template "{prompt}" \
--seed 42 | tee logs/evaluation_codellama_34b_${TASK}.log

View File

@ -1,59 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 48 \
--hidden-size 8192 \
--ffn-hidden-size 22016 \
--position-embedding-type rope \
--seq-length 4096 \
--max-new-tokens 256 \
--micro-batch-size 1 \
--global-batch-size 8 \
--num-attention-heads 64 \
--max-position-embeddings 16384 \
--swiglu \
--load "${CHECKPOINT}" \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path "${TOKENIZER_PATH}" \
--tokenizer-not-use-fast \
--fp16 \
--normalization RMSNorm \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-softmax-in-fp32 \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--no-gradient-accumulation-fusion \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 32 \
--vocab-size 32000 \
--padded-vocab-size 32000 \
--rotary-base 1000000 \
--group-query-attention \
--num-query-groups 8 \
| tee logs/generate_codellama_34b.log

View File

@ -1,98 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=2
NODE_RANK=0
WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=2
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--num-layers 48 \
--hidden-size 8192 \
--ffn-hidden-size 22016 \
--num-attention-heads 64 \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
--seq-length 4096 \
--max-position-embeddings 16384 \
--micro-batch-size 2 \
--global-batch-size 1024 \
--make-vocab-size-divisible-by 1 \
--lr 1.0e-7 \
--train-iters 2000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--use-fused-rotary-pos-emb \
--use-rotary-position-embeddings \
--use-fused-swiglu \
--use-mc2 \
--swiglu \
--use-flash-attn \
--group-query-attention \
--num-query-groups 8 \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.0e-8 \
--weight-decay 1e-2 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.999 \
--initial-loss-scale 8188.0 \
--no-gradient-accumulation-fusion \
--load ${CKPT_LOAD_DIR} \
--no-load-optim \
--no-load-rng \
--rotary-base 1000000 \
--vocab-size 32000 \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 2000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--save $CKPT_SAVE_DIR \
| tee logs/train_codellama_34b.log

View File

@ -1,564 +0,0 @@
# Gemma $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
# 目录
- [Gemma-2B](#Gemma-2B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理](#推理)
- [评估](#评估)
- [Gemma-7B](#Gemma-7B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理](#推理)
- [评估](#评估)
# Gemma-2B
## 训练
Gemma-2B 训练的硬件配置:
| 硬件 | 配置 |
|:---:| :-------------: |
| NPU | 8 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 Gemma-2B 的 [预训练权重和词表](https://huggingface.co/google/gemma-2b/tree/main)
```bash
mkdir ./model_from_hf/Gemma-2B/
cd ./model_from_hf/Gemma-2B/
wget https://huggingface.co/google/gemma-2b/resolve/main/config.json
wget https://huggingface.co/google/gemma-2b/resolve/main/generation_config.json
wget https://huggingface.co/google/gemma-2b/resolve/main/model-00001-of-00002.safetensors
wget https://huggingface.co/google/gemma-2b/resolve/main/model-00002-of-00002.safetensors
wget https://huggingface.co/google/gemma-2b/resolve/main/model.safetensors.index.json
wget https://huggingface.co/google/gemma-2b/resolve/main/special_tokens_map.json
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.json
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.model
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer_config.json
cd ../../
```
4. 权重转换
将权重从 huggingface 格式转化为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader gemma_hf \
--saver megatron \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/Gemma-2B/ \
--save-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
--tokenizer-model ./model_from_hf/Gemma-2B/tokenizer.model
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```bash
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_gemma \
--load-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Gemma-2B/ # 需要填入原始HF模型路径新权重会存于./model_from_hf/Gemma-2B/mg2hg/
```
5. 准备数据集
下载 Gemma-2B [数据集](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
cd ..
# 处理数据
mkdir ./dataset/Gemma-2B/
python ./tools/preprocess_data.py \
--input ./dataset/wikipedia-cn-20230720-filtered.json \
--output-prefix ./dataset/Gemma-2B/wikipedia_cn \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
--json-key completion \
--workers 16 \
--log-interval 1000
```
6. 预训练
配置Gemma-2B 预训练脚本: examples/gemma/pretrain_gemma_2b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
TOKENIZER_MODEL="./model_from_hf/Gemma-2B/" #词表路径
DATA_PATH="./dataset/Gemma-2B/wikipedia_cn_completion_document" #数据集路径
CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
```
启动 Gemma-2B 预训练脚本: examples/gemma/pretrain_gemma_2b_ptd.sh
```shell
bash examples/gemma/pretrain_gemma_2b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
7. 微调
7.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
```bash
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip --no-check-certificate
unzip moss-003-sft-no-tools.jsonl.zip
cd ..
# 处理数据集
mkdir ./finetune_dataset/Gemma-2B/
python tools/preprocess_data.py \
--input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
--output-prefix ./finetune_dataset/Gemma-2B/moss \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
--tokenizer-not-use-fast \
--handler-name MOSSInstructionHandler
```
7.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
增加微调参数--finetune使微调从第一步开始。
```bash
CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
DATA_PATH="./finetune_dataset/Gemma-2B/moss"
TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### 性能
#### 吞吐
Gemma-2B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | tokens吞吐 (tokens/s/p) |
|:----:|:--------:|:---------------------:|
| NPUs | Gemma-2B | 6821 |
| 参考 | Gemma-2B | 7602 |
## 推理
配置 Gemma-2B 推理脚本examples/gemma/generate_gemma_2b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
```
启动Gemma-2B推理脚本
```bash
bash examples/gemma/generate_gemma_2b_ptd.sh
```
## 评估
使用[MMLU数据集](https://huggingface.co/datasets/cais/mmlu)评估模型.
配置Gemma-2b评估脚本: examples/gemma/evaluate_gemma_2b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
TOKENIZER_PATH="./model_from_hf/Gemma-2B/" #词表路径
CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/" #模型路径
# 配置任务和数据集路径
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/gemma/evaluate_gemma_2b_ptd.sh
```
| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
|:---:|:---:|:---:|:-----:|:------:|
| MMLU | 57 | 14042 | 39.7 | 39.4 |
# Gemma-7B
## 训练
Gemma-7B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 8 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 Gemma-7B 的 [预训练权重和词表](https://huggingface.co/Gemma/Gemma-7B/tree/main)
```bash
mkdir ./model_from_hf/Gemma-7B/
cd ./model_from_hf/Gemma-7B/
wget https://huggingface.co/google/gemma-7b/resolve/main/config.json
wget https://huggingface.co/google/gemma-7b/resolve/main/generation_config.json
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00001-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00002-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00003-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00004-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model.safetensors.index.json
wget https://huggingface.co/google/gemma-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.json
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. 权重转换
将权重从 huggingface 格式转化为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader gemma_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Gemma-7B/ \
--save-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Gemma-7B/tokenizer.model
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```bash
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_gemma \
--load-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Gemma-7B/ # 需要填入原始HF模型路径新权重会存于./model_from_hf/Gemma-7B/mg2hg/
```
5. 准备数据集
下载 Gemma-7B [数据集](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
cd ..
# 处理数据
mkdir ./dataset/Gemma-7B/
python ./tools/preprocess_data.py \
--input ./dataset/wikipedia-cn-20230720-filtered.json \
--output-prefix ./dataset/Gemma-7B/wikipedia_cn \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
--json-key completion \
--workers 16 \
--log-interval 1000
```
6. 预训练
配置Gemma-7B 预训练脚本: examples/gemma/pretrain_gemma_7b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
TOKENIZER_MODEL="./model_from_hf/Gemma-7B/" #词表路径
DATA_PATH="./dataset/Gemma-7B/wikipedia_cn_completion_document" #数据集路径
CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
```
启动 Gemma-7B 预训练脚本: examples/gemma/pretrain_gemma_7b_ptd.sh
```shell
bash examples/gemma/pretrain_gemma_7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
7. 微调
7.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
```bash
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip --no-check-certificate
unzip moss-003-sft-no-tools.jsonl.zip
cd ..
# 处理数据集
mkdir ./finetune_dataset/Gemma-7B/
python tools/preprocess_data.py \
--input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
--output-prefix ./finetune_dataset/Gemma-7B/moss \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
--tokenizer-not-use-fast \
--handler-name MOSSInstructionHandler
```
7.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
增加微调参数--finetune使微调从第一步开始。
```bash
CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
DATA_PATH="./finetune_dataset/Gemma-7B/moss"
TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
7.3 Lora微调
Lora微调的脚本配置是在全参微调脚本基础上加上lora参数如下所示:
```bash
--lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
--lora-r 16 \
--lora-alpha 32 \
```
如果模型的词表变化了,可以加上以下参数(词表不变不建议添加)
```bash
--lora-modules-to-save word_embeddings output_layer \
```
添加下列参数用于从上一个检查点恢复Lora模型继续训练:
```bash
--load ${ORIGIN_CHECKPOINT} \
--lora-load ${LORA_CHECKPOINT} \
```
启动Lora微调脚本: examples/gemma/tune_gemma_7b_ptd.sh
```shell
bash examples/gemma/tune_gemma_7b_ptd.sh
```
### 性能
#### 吞吐
Gemma-7B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | tokens吞吐 (tokens/s/p) |
|:------:|:-------:|:---------------------:|
| NPUs | Gemma-7B | 2938 |
| 参考 | Gemma-7B | 2607 |
## 推理
配置 Gemma-7B 推理脚本examples/gemma/generate_gemma_7b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
```
配置 Gemma-7B lora推理脚本: examples/gemma/generate_gemma_7b_lora_ptd.sh
```bash
# 修改lora权重路径
CHECKPOINT_LORA="your lora model directory path"
```
启动Gemma-7B推理脚本
```bash
bash examples/gemma/generate_gemma_7b_ptd.sh
```
启动Gemma-7B lora推理脚本
```bash
bash examples/gemma/generate_gemma_7b_lora_ptd.sh
```
Lora推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/gemma/gemma-7b-lora-inference.jpg)
## 评估
使用[MMLU数据集](https://huggingface.co/datasets/cais/mmlu)评估模型.
配置Gemma-7B评估脚本: examples/gemma/evaluate_gemma_7b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
TOKENIZER_PATH="./model_from_hf/Gemma-7B/" #词表路径
CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/" #模型路径
# 配置任务和数据集路径
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/gemma/evaluate_gemma_7b_ptd.sh
```
| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
|:---:|:---:|:---:|:-----:|:------:|
| MMLU | 57 | 14042 | 52.2 | 52.2 |

View File

@ -1,564 +0,0 @@
# Gemma $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
</p>
# Contents
- [Gemma-2B](#Gemma-2B)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference](#Inference)
- [Evaluation](#Evaluation)
- [Gemma-7B](#Gemma-7B)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference](#Inference)
- [Evaluation](#Evaluation)
# Gemma-2B
## Training
Here's a hardware summary of pre-training Gemma-2B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights and tokenizer
Download the Gemma-2B checkpoint from [here](https://huggingface.co/google/gemma-2b/tree/main)
```bash
mkdir ./model_from_hf/Gemma-2B/
cd ./model_from_hf/Gemma-2B/
wget https://huggingface.co/google/gemma-2b/resolve/main/config.json
wget https://huggingface.co/google/gemma-2b/resolve/main/generation_config.json
wget https://huggingface.co/google/gemma-2b/resolve/main/model-00001-of-00002.safetensors
wget https://huggingface.co/google/gemma-2b/resolve/main/model-00002-of-00002.safetensors
wget https://huggingface.co/google/gemma-2b/resolve/main/model.safetensors.index.json
wget https://huggingface.co/google/gemma-2b/resolve/main/special_tokens_map.json
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.json
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer.model
wget https://huggingface.co/google/gemma-2b/resolve/main/tokenizer_config.json
cd ../../
```
4. Weights convert
Convert weights from huggingface format to megatron format
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader gemma_hf \
--saver megatron \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--load-dir ./model_from_hf/Gemma-2B/ \
--save-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
--tokenizer-model ./model_from_hf/Gemma-2B/tokenizer.model
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_gemma \
--load-dir ./model_weights/Gemma-2B-v0.1-tp1-pp2/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Gemma-2B/ # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Gemma-2B/mg2hg/
```
5. Prepare dataset
Download the Gemma-2B datasets from [here](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
cd ..
# process datasets
mkdir ./dataset/Gemma-2B/
python ./tools/preprocess_data.py \
--input ./dataset/wikipedia-cn-20230720-filtered.json \
--output-prefix ./dataset/Gemma-2B/wikipedia_cn \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
--json-key completion \
--workers 16 \
--log-interval 1000
```
6. pre-training
Config Gemma-2B pre-training script: examples/gemma/pretrain_gemma_2b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
TOKENIZER_MODEL="./model_from_hf/Gemma-2B/" #tokenizer path
DATA_PATH="./dataset/Gemma-2B/wikipedia_cn_completion_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
```
Launch Gemma-2B pre-training script: examples/gemma/pretrain_gemma_2b_ptd.sh
```shell
bash examples/gemma/pretrain_gemma_2b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
7. fine-tuning
7.1 Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
```bash
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip --no-check-certificate
unzip moss-003-sft-no-tools.jsonl.zip
cd ..
# process datasets
mkdir ./finetune_dataset/Gemma-2B/
python tools/preprocess_data.py \
--input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
--output-prefix ./finetune_dataset/Gemma-2B/moss \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-2B/ \
--tokenizer-not-use-fast \
--handler-name MOSSInstructionHandler
```
7.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_gemma_2b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
CKPT_SAVE_DIR="./ckpt/Gemma-2B/"
DATA_PATH="./finetune_dataset/Gemma-2B/moss"
TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
CKPT_LOAD_DIR="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### Performance
#### Machine performance
The performance of Gemma-2B in **Ascend NPU** and **Reference**:
| Device | Model | throughput rate (tokens/s/p) |
|:---------:|:--------:|:----------------------------:|
| NPUs | Gemma-2B | 6821 |
| Reference | Gemma-2B | 7602 |
## Inference
Config Gemma-2B inference script: examples/gemma/generate_gemma_2b_ptd.sh
```bash
# ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/"
TOKENIZER_PATH="./model_from_hf/Gemma-2B/"
```
Launch Gemma-2B inference script: examples/gemma/generate_gemma_2b_ptd.sh
```bash
bash examples/gemma/generate_gemma_2b_ptd.sh
```
## Evaluation
We use the [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
Config Gemma-2b evaluation script: examples/gemma/evaluate_gemma_2b_ptd.sh
```bash
# ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model parameter path and vocabulary path
TOKENIZER_PATH="./model_from_hf/Gemma-2B/" # vocabulary path
CHECKPOINT="./model_weights/Gemma-2B-v0.1-tp1-pp2/" # parameter path
# Configure the task type and dataset path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Launch Gemma-2B evaluation
```bash
bash examples/gemma/evaluate_gemma_2b_ptd.sh
```
| Task | Subset | Question | OpenSource | NPU |
|:---:|:---:|:---:|:----------:|:----:|
| MMLU | 57 | 14042 | 39.7 | 39.4 |
# Gemma-7B
## Training
Here's a hardware summary of pre-training Gemma-7B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.2.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.2.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights and tokenizer
Download the Gemma-7B checkpoint from [here](https://huggingface.co/Gemma/Gemma-7B/tree/main)
```bash
mkdir ./model_from_hf/Gemma-7B/
cd ./model_from_hf/Gemma-7B/
wget https://huggingface.co/google/gemma-7b/resolve/main/config.json
wget https://huggingface.co/google/gemma-7b/resolve/main/generation_config.json
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00001-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00002-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00003-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model-00004-of-00004.safetensors
wget https://huggingface.co/google/gemma-7b/resolve/main/model.safetensors.index.json
wget https://huggingface.co/google/gemma-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.json
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model
wget https://huggingface.co/google/gemma-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. Weights convert
Convert weights from huggingface format to megatron format
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader gemma_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/Gemma-7B/ \
--save-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Gemma-7B/tokenizer.model
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_gemma \
--load-dir ./model_weights/Gemma-7B-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/Gemma-7B/ # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Gemma-7B/mg2hg/
```
5. Prepare dataset
Download the Gemma-7B datasets from [here](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json
cd ..
# process datasets
mkdir ./dataset/Gemma-7B/
python ./tools/preprocess_data.py \
--input ./dataset/wikipedia-cn-20230720-filtered.json \
--output-prefix ./dataset/Gemma-7B/wikipedia_cn \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
--json-key completion \
--workers 16 \
--log-interval 1000
```
6. pre-training
Config Gemma-7B pre-training script: examples/gemma/pretrain_gemma_7b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
TOKENIZER_MODEL="./model_from_hf/Gemma-7B/" #tokenizer path
DATA_PATH="./dataset/Gemma-7B/wikipedia_cn_completion_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
```
Launch Gemma-7B pre-training script: examples/gemma/pretrain_gemma_7b_ptd.sh
```shell
bash examples/gemma/pretrain_gemma_7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
7. fine-tuning
7.1 Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/fnlp/moss-003-sft-data/tree/main)
```bash
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip --no-check-certificate
unzip moss-003-sft-no-tools.jsonl.zip
cd ..
# process datasets
mkdir ./finetune_dataset/Gemma-7B/
python tools/preprocess_data.py \
--input ./finetune_dataset/moss-003-sft-no-tools.jsonl \
--output-prefix ./finetune_dataset/Gemma-7B/moss \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Gemma-7B/ \
--tokenizer-not-use-fast \
--handler-name MOSSInstructionHandler
```
7.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_gemma_7b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
CKPT_SAVE_DIR="./ckpt/Gemma-7B/"
DATA_PATH="./finetune_dataset/Gemma-7B/moss"
TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
CKPT_LOAD_DIR="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
7.3 Lora Fine-Tuning
The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_gemma_7b_ptd.sh script:
```bash
--lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
--lora-r 16 \
--lora-alpha 32 \
```
If the vocabulary is changed, add the following parameters:
```bash
--lora-modules-to-save word_embeddings output_layer \
```
The following parameters are added to the resumable training capability of Lora:
```bash
--load ${ORIGIN_CHECKPOINT} \
--lora-load ${LORA_CHECKPOINT} \
```
Launch Gemma-7B lora fine tune script: examples/finetune/tune_gemma_7b_ptd.sh
```shell
bash examples/gemma/tune_gemma_7b_ptd.sh
```
### Performance
#### Machine performance
The performance of Gemma-7B in **Ascend NPU** and **Reference**:
| Device | Model | throughput rate (tokens/s/p) |
|:---------:|:-------:|:----------------------------:|
| NPUs | Gemma-7B | 2938 |
| Reference | Gemma-7B | 2607 |
## Inference
Config Gemma-7B inference script: examples/gemma/generate_gemma_7b_ptd.sh
```bash
# ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Gemma-7B/"
```
Config Gemma-7B lora inference script: examples/gemma/generate_gemma_7b_lora_ptd.sh
```bash
# modify lora model path
CHECKPOINT_LORA="your lora model directory path"
```
Launch Gemma-7B inference script: examples/gemma/generate_gemma_7b_ptd.sh
```bash
bash examples/gemma/generate_gemma_7b_ptd.sh
```
Launch Gemma-7B lora inference script: examples/gemma/generate_gemma_7b_lora_ptd.sh
```bash
bash examples/gemma/generate_gemma_7b_lora_ptd.sh
```
Some lora inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/gemma/gemma-7b-lora-inference.jpg)
## Evaluation
We use the [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
Config Gemma-7B evaluation script: examples/gemma/evaluate_gemma_7b_ptd.sh
```bash
# ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model parameter path and vocabulary path
TOKENIZER_PATH="./model_from_hf/Gemma-7B/" # vocabulary path
CHECKPOINT="./model_weights/Gemma-7B-v0.1-tp8-pp1/" # parameter path
# Configure the task type and dataset path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Launch Gemma-7B evaluation
```bash
bash examples/gemma/evaluate_gemma_7b_ptd.sh
```
| Task | Subset | Question | OpenSource | NPU |
|:---:|:---:|:---:|:----------:|:----:|
| MMLU | 57 | 14042 | 52.2 | 52.2 |

View File

@ -1,59 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# distributed config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=2
# modify script model path and tokenizer path
TOKENIZER_PATH="your tokenizer directory path"
CHECKPOINT="your model directory path"
# configure task and data path
DATA_PATH="/../mmlu/test/"
TASK="mmlu"
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--load ${CHECKPOINT} \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--num-layers 18 \
--hidden-size 2048 \
--ffn-hidden-size 16384 \
--num-attention-heads 8 \
--group-query-attention \
--num-query-groups 1 \
--kv-channels 256 \
--max-position-embeddings 8192 \
--seq-length 8192 \
--max-new-tokens 1 \
--geglu \
--position-embedding-type rope \
--disable-bias-linear \
--normalization RMSNorm \
--add-rmsnorm-offset \
--input-embeds-norm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--norm-epsilon 1e-06 \
--evaluation-batch-size 1 \
--micro-batch-size 1 \
--no-masked-softmax-fusion \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--vocab-size 256000 \
--make-vocab-size-divisible-by 1 \
--bf16 \
--seed 42 | tee logs/evaluation_gemma_2b_${TASK}.log

View File

@ -1,58 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# distributed config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
# modify script model path and tokenizer path
TOKENIZER_PATH="your tokenizer directory path"
CHECKPOINT="your model directory path"
# configure task and data path
DATA_PATH="/../mmlu/test/"
TASK="mmlu"
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--load ${CHECKPOINT} \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 28 \
--hidden-size 3072 \
--ffn-hidden-size 24576 \
--num-attention-heads 16 \
--kv-channels 256 \
--max-position-embeddings 8192 \
--seq-length 8192 \
--max-new-tokens 1 \
--geglu \
--position-embedding-type rope \
--disable-bias-linear \
--normalization RMSNorm \
--add-rmsnorm-offset \
--input-embeds-norm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--norm-epsilon 1e-06 \
--evaluation-batch-size 1 \
--micro-batch-size 1 \
--use-fused-rmsnorm \
--no-masked-softmax-fusion \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--vocab-size 256000 \
--make-vocab-size-divisible-by 1 \
--bf16 \
--seed 42 | tee logs/evaluation_gemma_7b_${TASK}.log

View File

@ -1,56 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export WITHOUT_JIT_COMPILE=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=2
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--load ${CHECKPOINT} \
--num-layers 18 \
--hidden-size 2048 \
--kv-channels 256 \
--group-query-attention \
--num-query-groups 1 \
--ffn-hidden-size 16384 \
--num-attention-heads 8 \
--position-embedding-type rope \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--geglu \
--input-embeds-norm \
--micro-batch-size 1 \
--norm-epsilon 1e-06 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--normalization RMSNorm \
--add-rmsnorm-offset \
--disable-bias-linear \
--hidden-dropout 0 \
--attention-dropout 0 \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 1 \
--vocab-size 256000 \
--bf16 \
--seed 42 \
| tee logs/generate_gemma_2b.log

View File

@ -1,63 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CHECKPOINT="your model directory path"
CHECKPOINT_LORA="your lora model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--load ${CHECKPOINT} \
--num-layers 28 \
--hidden-size 3072 \
--kv-channels 256 \
--ffn-hidden-size 24576 \
--num-attention-heads 16 \
--position-embedding-type rope \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--geglu \
--input-embeds-norm \
--micro-batch-size 1 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--normalization RMSNorm \
--add-rmsnorm-offset \
--norm-epsilon 1e-06 \
--disable-bias-linear \
--hidden-dropout 0 \
--attention-dropout 0 \
--attention-softmax-in-fp32 \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--no-gradient-accumulation-fusion \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 1 \
--vocab-size 256000 \
--bf16 \
--seed 42 \
--lora-load ${CHECKPOINT_LORA} \
--lora-r 16 \
--lora-alpha 32 \
--lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
--inference-prompt-type 'alpaca' \
| tee logs/generate_gemma_7b.log

View File

@ -1,57 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--load ${CHECKPOINT} \
--num-layers 28 \
--hidden-size 3072 \
--kv-channels 256 \
--ffn-hidden-size 24576 \
--num-attention-heads 16 \
--position-embedding-type rope \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--geglu \
--input-embeds-norm \
--micro-batch-size 1 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--normalization RMSNorm \
--add-rmsnorm-offset \
--norm-epsilon 1e-06 \
--disable-bias-linear \
--hidden-dropout 0 \
--attention-dropout 0 \
--attention-softmax-in-fp32 \
--no-load-optim \
--no-load-rng \
--no-masked-softmax-fusion \
--no-gradient-accumulation-fusion \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 1 \
--vocab-size 256000 \
--bf16 \
--seed 42 \
| tee logs/generate_gemma_7b.log

View File

@ -1,95 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=1
PP=2
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--use-mc2 \
--use-fused-rmsnorm \
--num-layers 18 \
--hidden-size 2048 \
--ffn-hidden-size 16384 \
--num-attention-heads 8 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--seq-length 8192 \
--max-position-embeddings 8192 \
--micro-batch-size 1 \
--global-batch-size 256 \
--kv-channels 256 \
--group-query-attention \
--num-query-groups 1 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--normalization RMSNorm \
--add-rmsnorm-offset \
--geglu \
--input-embeds-norm \
--use-flash-attn \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--use-distributed-optimizer \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 1000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_gemma_2b.log

View File

@ -1,95 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--use-mc2 \
--use-fused-rmsnorm \
--use-fused-rotary-pos-emb \
--num-layers 28 \
--hidden-size 3072 \
--ffn-hidden-size 24576 \
--num-attention-heads 16 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--seq-length 8192 \
--max-position-embeddings 8192 \
--micro-batch-size 2 \
--global-batch-size 64 \
--kv-channels 256 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--normalization RMSNorm \
--add-rmsnorm-offset \
--norm-epsilon 1e-06 \
--geglu \
--input-embeds-norm \
--use-flash-attn \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--vocab-size 256000 \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 1000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_gemma_7b.log

View File

@ -1,103 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
LORA_CHECKPOINT="your lora ckpt path"
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--use-mc2 \
--use-fused-rmsnorm \
--use-fused-rotary-pos-emb \
--num-layers 28 \
--hidden-size 3072 \
--ffn-hidden-size 24576 \
--num-attention-heads 16 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--seq-length 8192 \
--max-position-embeddings 8192 \
--micro-batch-size 2 \
--global-batch-size 64 \
--kv-channels 256 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--normalization RMSNorm \
--add-rmsnorm-offset \
--norm-epsilon 1e-06 \
--geglu \
--input-embeds-norm \
--use-flash-attn \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--vocab-size 256000 \
--finetune \
--is-instruction-dataset \
--lora-r 16 \
--lora-alpha 32 \
--lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 2000 \
--eval-interval 1000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--lora-load ${LORA_CHECKPOINT} \
--save ${CKPT_SAVE_DIR} \
| tee logs/tune_gemma_7b.log

View File

@ -1,88 +0,0 @@
#!/bin/bash
# Runs the "175B" parameter model in deminishing layers for single machine
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export AZUREML_EXPERIMENT_ID=0
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
DATA_PATH="your dataset path"
VOCAB_FILE="vocab file for gpt"
MERGE_FILE="merge file for gpt"
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NUM_NODES \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--node_rank $NODE_RANK
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--num-layers 8 \
--hidden-size 12288 \
--num-attention-heads 96 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--transformer-impl local \
--micro-batch-size 1 \
--global-batch-size 64 \
--train-iters 2000 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--init-method-std 0.006 \
--clip-grad 1.0 \
--fp16 \
--lr 6.0e-5 \
--lr-decay-style cosine \
--min-lr 6.0e-6 \
--lr-warmup-fraction .001 \
--lr-decay-iters 430000 \
--no-load-optim \
--no-load-rng \
--no-gradient-accumulation-fusion \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--use-flash-attn \
--no-bias-gelu-fusion \
--use-mc2
"
DATA_ARGS="
--data-path $DATA_PATH
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 1
--eval-interval 5000
--eval-iters 1
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--jit-compile \
--distributed-backend nccl 2>&1
| tee ./logs/pretrain_gpt3_175B_8layers.log

View File

@ -1,89 +0,0 @@
#!/bin/bash
# Runs the "175B" parameter model in full layers.
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export AZUREML_EXPERIMENT_ID=0
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NUM_NODES=16
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
DATA_PATH="your dataset path"
VOCAB_FILE="vocab file for gpt training"
MERGE_FILE="merge file for gpt training"
TP=8
PP=16
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NUM_NODES \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--node_rank $NODE_RANK
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layers-per-virtual-pipeline-stage 2 \
--sequence-parallel \
--num-layers 96 \
--hidden-size 12288 \
--num-attention-heads 96 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--transformer-impl local \
--micro-batch-size 2 \
--global-batch-size 1024 \
--train-iters 2000 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--init-method-std 0.006 \
--clip-grad 1.0 \
--fp16 \
--lr 6.0e-5 \
--lr-decay-style cosine \
--min-lr 6.0e-6 \
--lr-warmup-fraction .001 \
--lr-decay-iters 430000 \
--no-load-optim \
--no-load-rng \
--no-gradient-accumulation-fusion \
--no-masked-softmax-fusion \
--no-bias-gelu-fusion \
--attention-softmax-in-fp32 \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--use-flash-attn \
--overlap-grad-reduce \
--use-mc2
"
DATA_ARGS="
--data-path $DATA_PATH
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 1
--eval-interval 5000
--eval-iters 1
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--jit-compile \
--distributed-backend nccl 2>&1
| tee ./logs/pretrain_gpt3_175B.log

View File

@ -1,136 +0,0 @@
# GPT3 $\color{black}{\bf\tiny{【社区贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README.md">English</a> </b>
</p>
# 目录
- [GPT3](#GPT3)
- [目录](#目录)
- [GPT3-175B](#GPT3-175B)
- [训练-175B](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
# GPT3-175B
## 训练
GPT3-175B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 128 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir vocab_file
mkdir dataset
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装 MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其他依赖
pip install -r requirements.txt
```
3. 准备数据、词表来拉起模型
3.1 准备数据
可以从 [这里](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en) 下载原始数据
```shell
# 下载 enwiki 数据
# 总共有 41 个文件,我们可以选择部分来制作数据
cd ./dataset
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet
cd ..
# 下载 vocab file 和 merge table
cd vocab_file
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cd ..
# 处理成训练数据
python ./tools/preprocess_data.py \
--input ./dataset/ \
--output-prefix ./dataset/gpt_text_sentence \
--tokenizer-type GPT2BPETokenizer \
--vocab-file ./vocab_file/gpt2-vocab.json \
--merge-file ./vocab_file/gpt2-merges.txt \
--append-eod \
--workers 4 \
--log-interval 1000
```
3.2 用 ptd 模式进行预训练
配置 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh
```shell
# 请根据真实情况配置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 请根据真实存放路径配置以下参数
VOCAB_FILE="./vocab_file/gpt2-vocab.json" # 词表
MERGE_FILE="./vocab_file/gpt2-merges.txt" # BPE 合并表
DATA_PATH="./dataset/gpt_text_sentence" # 数据路径
```
拉起 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh
```shell
bash examples/gpt3/pretrain_gpt3_175B.sh
```
### 性能
#### 吞吐
GPT3-175B 在 **昇腾芯片**上的性能数据:
| 设备 | 模型 | tokens吞吐 (tokens/s/p) |
| :--: | :--------: |:---------------------:|
| NPUs | GPT3-175B | 153.1 |

View File

@ -1,136 +0,0 @@
# GPT3 $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$
<p align="left">
<b>English</b> |
<b><a href="README_en.md">English</a> </b>
</p>
# Contents
- [GPT3](#GPT3)
- [Contents](#contents)
- [GPT3-175B](#GPT3-175B)
- [Training-175B](#training)
- [Script](#script)
- [Perforfance](#performance)
- [Machine performance](#machine-performance)
# GPT3-175B
## Training
Here is a hardware summary of pre-trianing GPT3-175B:
| Hardware | Value |
| :--: | :-------------: |
| NPU | 128 x Ascend NPUs |
### Script
1. Clone repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir vocab_file
mkdir dataset
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare dataset and vocab file for pretrain
3.1 Prepare dataset
Download the GPT raw dataset from [here](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en)
```shell
# download enwiki raw data
# There are 41 files in total, we can just select part to make our datasets.
cd ./dataset
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet
wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet
cd ..
# download vocab file and merge table
cd vocab_file
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cd ..
# process formal dataset
python ./tools/preprocess_data.py \
--input ./dataset/ \
--output-prefix ./dataset/gpt_text_sentence \
--tokenizer-type GPT2BPETokenizer \
--vocab-file ./vocab_file/gpt2-vocab.json \
--merge-file ./vocab_file/gpt2-merges.txt \
--append-eod \
--workers 4 \
--log-interval 1000
```
3.2 pre-training in ptd mode
Config GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh
```shell
# modify ascend-toolkit path according to your own config
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
VOCAB_FILE="./vocab_file/gpt2-vocab.json" # vocab file for training
MERGE_FILE="./vocab_file/gpt2-merges.txt" # BPE merge file for training
DATA_PATH="./dataset/gpt_text_sentence" # dataset path
```
Launch GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh
```shell
bash examples/gpt3/pretrain_gpt3_175B.sh
```
### Performance
#### Machine performance
The performance of GPT3-175B in **Ascend NPU**
| device | model | tokens capacity (tokens/s/p) |
| :--: | :--------: |:---------------------:|
| NPUs | GPT3-175B | 153.1 |

View File

@ -1,4 +1,4 @@
# Intern-LM
# Intern-LM $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
@ -35,141 +35,145 @@ InternLM-7B 训练的硬件配置如下:
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 Internlm-7B [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)
```shell
mkdir ./model_from_hf/internlm-7b/
cd ./model_from_hf/internlm-7b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/internlm-7b/
cd ./model_from_hf/internlm-7b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. 下载 Internlm-7B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-7b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
--output-prefix ./dataset/internlm-7b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-7b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
--output-prefix ./dataset/internlm-7b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
5. 权重格式转换
将模型权重从 huggingface 格式转换为 ModelLink 可以处理的格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
将模型权重从 huggingface 格式转换为 ModelLink 可以处理的格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/internlm-7b/ \
--save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
--add-qkv-bias \
--add-dense-bias
```
```shell
mkdir model_weights
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/internlm-7b/ \
--save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
--add-qkv-bias \
--add-dense-bias
```
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--add-dense-bias \
--save-dir ./model_from_hf/internlm-7b/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/internlm-7b/mg2hg/
```
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--add-dense-bias \
--save-dir ./model_from_hf/internlm-7b/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/internlm-7b/mg2hg/
```
6. 配置 Internlm-7B 预训练脚本
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
```
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #数据集路径
```
7. 启动 Internlm-7B 预训练脚本
```shell
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能
@ -184,25 +188,23 @@ Internlm-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
#### 推理
推理脚本: examples/intern/generate_internlm_7b_ptd.sh
推理脚本</a>
tasks/inference/generate_lnternlm_7b_ptd.sh
```
bash ./examples/intern/generate_internlm_7b_ptd.sh
bash ./tasks/inference/generate_lnternlm_7b_ptd.sh
```
推理举例:
![Internlm-7b-inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/intern/intern_7B_inference.png)
![Internlm-7b-inference](../../sources/images/intern/intern_7B_inference.png)
#### 评估
使用MMLU数据集评估模型。数据集[下载](https://huggingface.co/datasets/cais/mmlu)
评估脚本`</a>`:
examples/intern/evaluate_internlm_7B_ptd.sh
评估脚本</a>:
tasks/evaluation/evaluate_internlm_7B_ptd.sh
```
bash examples/intern/evaluate_internlm_7B_ptd.sh
bash tasks/evaluation/evaluate_internlm_7B_ptd.sh
```
InternLM-7B在**Ascend NPU**中的评测表现:
@ -226,102 +228,104 @@ InternLM-65B 训练的硬件配置如下:
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 [词表文件](https://huggingface.co/internlm/internlm-7b/tree/main)
```shell
mkdir ./model_from_hf/internlm-65b/
cd ./model_from_hf/internlm-65b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/internlm-65b/
cd ./model_from_hf/internlm-65b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. 下载 Internlm-65B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-65b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
--output-prefix ./dataset/internlm-65b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-65b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
--output-prefix ./dataset/internlm-65b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
5. 配置 Internlm-65B 预训练脚本
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
```
```shell
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改数据集,词表,权重等路径
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #词表路径
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #数据集路径
```
6. 启动 Internlm-65B 预训练脚本
```shell
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
```shell
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
### 性能

View File

@ -1,4 +1,4 @@
# Intern-LM
# Intern-LM $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
@ -36,141 +36,144 @@ Here's a hardware summary of pre-training InternLM-7B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Download the Internlm-7B tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main)
```shell
mkdir ./model_from_hf/internlm-7b/
cd ./model_from_hf/internlm-7b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/internlm-7b/
cd ./model_from_hf/internlm-7b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. Prepare dataset. Download the Internlm-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-7b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
--output-prefix ./dataset/internlm-7b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-7b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-7b/ \
--output-prefix ./dataset/internlm-7b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
5. Weights convert
In order to adapt to the internlm-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
In order to adapt to the internlm-7B model, the following script is used to convert the model pre-training weights.
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```shell
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/internlm-7b/ \
--save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
--add-qkv-bias \
--add-dense-bias
```
```shell
mkdir model_weights
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/internlm-7b/ \
--save-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/internlm-7b/tokenizer.model \
--add-qkv-bias \
--add-dense-bias
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--add-dense-bias \
--save-dir ./model_from_hf/internlm-7b/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/internlm-7b/mg2hg/
```
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/internlm-7b-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--add-qkv-bias \
--add-dense-bias \
--save-dir ./model_from_hf/internlm-7b/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/internlm-7b/mg2hg/
```
6. Config Internlm-7B pre-training script.
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
```
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-7b/"
CKPT_LOAD_DIR="./model_weights/internlm-7b-v0.1-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/internlm-7b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-7b/alpaca_text_document" #processed dataset
```
7. Launch Internlm-7B pre-training script.
```shell
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/intern/pretrain_internlm_7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -185,25 +188,23 @@ The performance of Internlm-7B in **Ascend NPU** and **Reference**:
#### Inference
Inference script`</a>`
examples/intern/generate_lnternlm_7b_ptd.sh
Inference script</a>
tasks/inference/generate_lnternlm_7b_ptd.sh
```
bash ./examples/intern/generate_lnternlm_7b_ptd.sh
bash ./tasks/inference/generate_lnternlm_7b_ptd.sh
```
Inference case:
![Internlm-7b-inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/intern/intern_7B_inference.png)
![Internlm-7b-inference](../../sources/images/intern/intern_7B_inference.png)
#### Evaluation
Evaluating the model using the MMLU dataset. dataset [download](https://huggingface.co/datasets/cais/mmlu)
Evaluation script: examples/intern/evaluate_internlm_7B_ptd.sh
Evaluation script</a>:
tasks/evaluation/evaluate_internlm_7B_ptd.sh
```
bash examples/intern/evaluate_internlm_7B_ptd.sh
bash tasks/evaluation/evaluate_internlm_7B_ptd.sh
```
The evaluation performance of LLaMA-7B/13B in **Ascend NPU**:
@ -226,102 +227,105 @@ Here's a hardware summary of pre-training InternLM-65B:
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
# install other packages
pip install -r requirements.txt
```
3. Download tokenizer model and file from [here](https://huggingface.co/internlm/internlm-7b/tree/main)
```shell
mkdir ./model_from_hf/internlm-65b/
cd ./model_from_hf/internlm-65b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
```shell
mkdir ./model_from_hf/internlm-65b/
cd ./model_from_hf/internlm-65b/
wget https://huggingface.co/internlm/internlm-7b/resolve/main/config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/generation_config.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/special_tokens_map.json
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenization_internlm.py
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer.model
wget https://huggingface.co/internlm/internlm-7b/resolve/main/tokenizer_config.json
cd ../../
```
4. Prepare dataset. Download the Internlm-65B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-65b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
--output-prefix ./dataset/internlm-65b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
```shell
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/internlm-65b/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/internlm-65b/ \
--output-prefix ./dataset/internlm-65b/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name AlpacaPretrainHandler \
--tokenizer-not-use-fast \
--append-eod
```
5. Config Internlm-65B pre-training script.
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_PATH="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
```
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/internlm-65b/"
TOKENIZER_MODEL="./model_from_hf/internlm-65b/tokenizer.model" #tokenizer path
DATA_PATH="./dataset/internlm-65b/alpaca_text_document" #processed dataset
```
6. Launch Internlm-65B pre-training script.
```shell
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```shell
bash examples/intern/pretrain_internlm_65b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -89,7 +90,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_internlm_65B.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -90,6 +91,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_internlm_7b.log

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -86,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save ${SAVE_CHECKPOINT_PATH} \
| tee logs/train_llama_13b.log

View File

@ -87,6 +87,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_llama_33b.log

View File

@ -87,7 +87,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save ${SAVE_CHECKPOINT_PATH} \
| tee logs/train_llama_65b.log

View File

@ -86,6 +86,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save ${SAVE_CHECKPOINT_PATH} \
| tee logs/train_llama_7b.log

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -35,8 +37,8 @@ GPT_ARGS="
--tokenizer-model ${TOKENIZER_MODEL} \
--seq-length 4096 \
--max-position-embeddings 4096 \
--micro-batch-size 4 \
--global-batch-size 512 \
--micro-batch-size 2 \
--global-batch-size 16 \
--make-vocab-size-divisible-by 1 \
--lr 1e-6 \
--train-iters 5000 \
@ -64,9 +66,6 @@ GPT_ARGS="
--load ${CKPT_LOAD_DIR} \
--no-load-optim \
--no-load-rng \
--use-fused-swiglu \
--use-fused-rotary-pos-emb \
--use-mc2 \
--bf16
"
@ -87,6 +86,5 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--save $CKPT_SAVE_DIR \
| tee logs/train_llama2_13b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -67,10 +68,7 @@ GPT_ARGS="
--load ${CKPT_LOAD_DIR} \
--no-load-optim \
--no-load-rng \
--bf16 \
--use-fused-swiglu \
--use-fused-rotary-pos-emb \
--use-mc2 \
--bf16
"
DATA_ARGS="

View File

@ -1,10 +1,11 @@
#!/bin/bash
export NPU_ASD_ENABLE=0
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=4
NNODES=8
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
@ -13,7 +14,7 @@ DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=4
PP=8
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
@ -35,7 +36,7 @@ GPT_ARGS="
--tokenizer-model ${TOKENIZER_MODEL} \
--seq-length 4096 \
--max-position-embeddings 4096 \
--micro-batch-size 1 \
--micro-batch-size 2 \
--global-batch-size 1024 \
--make-vocab-size-divisible-by 1 \
--lr 1.0e-6 \
@ -67,11 +68,7 @@ GPT_ARGS="
--group-query-attention \
--num-query-groups 8 \
--lr-warmup-fraction 0.01 \
--bf16 \
--use-fused-swiglu \
--use-fused-rotary-pos-emb \
--num-layers-per-virtual-pipeline-stage 5 \
--use-mc2 \
--bf16
"
DATA_ARGS="

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -13,8 +14,8 @@ CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=1
PP=2
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
@ -36,8 +37,8 @@ GPT_ARGS="
--tokenizer-model ${TOKENIZER_MODEL} \
--seq-length 4096 \
--max-position-embeddings 4096 \
--micro-batch-size 1 \
--global-batch-size 256 \
--micro-batch-size 4 \
--global-batch-size 16 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 5000 \
@ -64,23 +65,19 @@ GPT_ARGS="
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--use-distributed-optimizer \
--use-fused-swiglu \
--use-fused-rotary-pos-emb \
--overlap-grad-reduce \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 949,50,1
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
@ -88,7 +85,6 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--jit-compile \
--load $CKPT_LOAD_DIR \
--save $CKPT_SAVE_DIR \
| tee logs/train_llama2_7b.log

View File

@ -1,577 +0,0 @@
# LLaMA3 $\color{black}{\bf\tiny{【昇腾贡献模型】}}$
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
# 目录
- [LLaMA3](#llama3)
- [目录](#目录)
- [LLAMA3-8B](#llama3-8b)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理-8B](#推理-8b)
- [评估-8B](#评估-8b)
- [LLAMA3-70B](#llama3-70b)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [推理-70B](#推理-70b)
- [评估-70B](#评估-70b)
# LLAMA3-8B
## 训练
LLAMA3-8B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 8 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 LLAMA3-8B 的 [预训练权重和词表](https://huggingface.co/unsloth/llama-3-8B/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/llama-3-8b-hf/
cd ./model_from_hf/llama-3-8b-hf/
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/config.json
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/generation_config.json
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00001-of-00004.safetensors
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00002-of-00004.safetensors
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00003-of-00004.safetensors
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model-00004-of-00004.safetensors
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/model.safetensors.index.json
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/special_tokens_map.json
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/tokenizer.json
wget https://huggingface.co/unsloth/llama-3-8B/raw/main/tokenizer_config.json
cd ../../
```
4. 权重转换
4.1 将权重从 huggingface 格式转化为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 权重格式转换
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/llama-3-8b-hf/ \
--save-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/llama-3-8b-hf/tokenizer.json
```
4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/llama-3-8b-hf/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/llama-3-8b-hf/mg2hg/
```
权重转换适用于预训练、微调、推理和评估,根据任务不同调整参数 `target-tensor-parallel-size``target-pipeline-parallel-size`
5. 预训练
5.1 准备数据集
下载 LLaMA3-8B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/llama-3-8b-hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
--output-prefix ./dataset/llama-3-8b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 预训练
配置llama3-8B 预训练脚本: examples/llama3/pretrain_llama3_8b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
CKPT_SAVE_DIR="./ckpt/llama-3-8b-hf/"
TOKENIZER_MODEL="./model_from_hf/llama-3-8b-hf/" #词表路径
DATA_PATH="./dataset/llama-3-8b-hf/alpaca_text_document" #数据集路径
CKPT_LOAD_DIR="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/" #权重路径
```
多机运行增加参数--overlap-grad-reduce
启动 LLaMA3-8B 预训练脚本: examples/llama3/pretrain_llama3_8b_ptd.sh
```shell
bash examples/llama3/pretrain_llama3_8b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/llama-3-8b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
--output-prefix ./finetune_dataset/llama-3-8b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
增加微调参数--finetune增加预训练权重加载参数--load使微调从第一步开始。更改为以下参数
```bash
DATA_PATH="./finetune_dataset/llama-3-8b-hf/alpaca"
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
CKPT_PATH="./ckpt/llama-3-8b-hf/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### 性能
#### 吞吐
LLaMA3-8B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | tokens吞吐 (tokens/s/p) |
| :--: | :-------: | :----: | :---------------------: |
| NPUs | LLaMA3-8B | 1000 | 2483 |
| 参考 | LLaMA3-8B | 1000 | 2674 |
## 推理-8B
配置llama3-8B 推理脚本: examples/llama3/generate_llama3_8b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
```
启动llama3-8B 推理脚本
```bash
bash examples/llama3/generate_llama3_8b_ptd.sh
```
**如果想要使用大模型聊天功能请下载对应的Instruct/Chat权重并转换为Megatron格式配置相关路径
启动聊天脚本: examples/llama3/generate_llama3_8b_chat_ptd.sh**
LLAMA3-8B-Instruct 权重[下载](https://huggingface.co/unsloth/llama-3-8b-Instruct)
聊天示例如下:
```
You >> 你好
ModelLink:
你好!我是你的中文语言模型,很高兴和你交流!有什么想聊的题目或问题,请随时问我!
You >> 介绍一下llama3
ModelLink:
LLaMA 3 is a large-scale language model developed by Meta AI, a leading artificial intelligence research organization. It's a significant upgrade to the previous LLaMA models, offering improved performance and capabilities.
Here are some key features and facts about LLaMA 3:
Scale: LLaMA 3 is a massive model, consisting of 13.5 billion parameters, making it one of the largest language models in the world.
Architecture: It's based on the Transformer architecture, which is a type of neural network designed specifically for natural language processing tasks.
Training data: LLaMA 3 was trained on a massive dataset of text from the internet, including books, articles, and websites. This training data allows the model to learn about various topics, styles, and language patterns.
Language understanding: LLaMA 3 is capable of understanding natural language, including nuances, idioms, and figurative language. It can also recognize and generate text in multiple languages.
Conversational capabilities: The model can engage in conversations, answering questions, and responding to prompts. It can also generate text based on a given topic or theme.
Creative writing: LLaMA
You >> 能用中文回答吗
ModelLink:
LLaMA 3 是 Meta AI 开发的一种大规模语言模型,是之前 LLama 模型的升级版本,提供了改进的性能和能力。
以下是 LLama 3 的一些关键特点和事实:
规模LLaMA 3 是一个庞大的模型,具有 13.5 亿参数,世界上最大的语言模型之一。
架构:它基于 Transformer 架构,这是一种专门为自然语言处理任务设计的神经网络。
训练数据LLaMA 3 在互联网上的大量文本数据上进行了训练,包括书籍、文章和网站。这使得模型能够学习各种主题、风格和语言模式。
语言理解LLaMA 3 能够理解自然语言,包括 nuance、idiom 和 figurative language。它也能识别和生成多种语言的文本。
对话能力:模型能够参与对话,回答问题和响应提示。它也能基于给定的主题或主题生成文本。
创作写作LLa
```
## 评估-8B
使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
配置llama3-8B 评估脚本: examples/llama3/evaluate_llama3_8b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/" #词表路径
CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1" #模型路径
# 配置任务和数据集路径
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/llama3/evaluate_llama3_8b_ptd.sh
```
评估结果如下
| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
| :----: | :------: | :------: | :--------: | :-------: |
| MMLU | 57 | 14042 | 0.666 | 0.653 |
# LLAMA3-70B
## 训练
LLAMA3-70B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 64 x Ascend NPUs |
### 脚本
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 LLAMA3-70B 的 [预训练权重和词表](https://huggingface.co/v2ray/Llama-3-70B/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/llama-3-70b-hf/
cd ./model_from_hf/llama-3-70b-hf/
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/config.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/generation_config.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00001-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00002-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00003-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00004-of-00030.safetensors
...
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00030-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model.safetensors.index.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/special_tokens_map.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer_config.json
cd ../../
```
4. 权重转换
4.1 将权重从 huggingface 格式转化为 megatron 格式
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 权重格式转换
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 8 \
--load-dir ./model_from_hf/llama-3-70b-hf/ \
--save-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
--tokenizer-model ./model_from_hf/llama-3-70b-hf/tokenizer.json
```
4.2 任意并行切分策略的 Megatron 权重 格式转化为 HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```shell
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/llama-3-70b-hf/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/llama-3-70b-hf/mg2hg/
```
权重转换适用于预训练、微调、推理和评估,根据任务不同调整参数 `target-tensor-parallel-size``target-pipeline-parallel-size`
5. 预训练
5.1 准备数据集
下载 LLaMA3-70B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/llama-3-70b-hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
--output-prefix ./dataset/llama-3-70b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 预训练
配置llama3-70B 预训练脚本: examples/llama3/pretrain_llama3_70b_ptd.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
CKPT_SAVE_DIR="./ckpt/llama-3-70b-hf/"
TOKENIZER_MODEL="./model_from_hf/llama-3-70b-hf/" #词表路径
DATA_PATH="./dataset/llama-3-70b-hf/alpaca_text_document" #数据集路径
CKPT_LOAD_DIR="./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/" #权重路径
```
多机运行增加参数--overlap-grad-reduce
启动 LLaMA3-70B 预训练脚本: examples/llama3/pretrain_llama3_70b_ptd.sh
```shell
bash examples/llama3/pretrain_llama3_70b_ptd.sh
```
**注意**:如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加`--no-shared-storage`参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
6. 微调
6.1 准备微调数据集
下载微调数据集 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/llama-3-70b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
--output-prefix ./finetune_dataset/llama-3-70b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 全参微调
全参微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
增加微调参数--finetune增加预训练权重加载参数--load使微调从第一步开始。更改为以下参数
```bash
DATA_PATH="./finetune_dataset/llama-3-70b-hf/alpaca"
TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
CKPT_PATH="./ckpt/llama-3-70b-hf/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### 性能
#### 吞吐
LLaMA3-70B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | tokens吞吐 (tokens/s/p) |
| :--: | :-------: | :----: | :---------------------: |
| NPUs | LLaMA3-70B | 1000 | 283 |
| 参考 | LLaMA3-70B | - | 355 |
## 推理-70B
配置llama3-70B 推理脚本: examples/llama3/generate_llama3_70b_ptd.sh
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"
TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
```
启动llama3-70B 推理脚本
```bash
bash examples/llama3/generate_llama3_70b_ptd.sh
```
## 评估-70B
使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
配置llama3-70B 评估脚本: examples/llama3/evaluate_llama3_70b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/" #词表路径
CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1" #模型路径
# 配置任务和数据集路径
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/llama3/evaluate_llama3_70b_ptd.sh
```
评估结果如下
| 数据集 | 总学科数 | 总问题数 | 参考准确率 | NPU准确率 |
| :----: | :------: | :------: | :--------: | :-------: |
| MMLU | 57 | 14042 | 0.795 | 0.783 |

View File

@ -1,608 +0,0 @@
# LLaMA3 $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
</p>
# Contents
- [LLaMA3](#llama)
- [Contents](#contents)
- [LLAMA3-8B](#llama3-8b)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference-8B](#inference-8b)
- [Evaluation-8B](#evaluation-8b)
- [Contents](#contents)
- [LLAMA3-70B](#llama3-70b)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Inference-70B](#inference-70b)
- [Evaluation-70B](#evaluation-70b)
# LLAMA3-8B
## Training
Here's a hardware summary of pre-training LLAMA3-8B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
```text
# original deepspeed/runtime/engine.py, about #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# modified
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3. Prepare pretrained weights and tokenizer
Download the LLAMA3-8B checkpoint from [here](https://huggingface.co/unsloth/llama-3-8B/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/llama-3-8b-hf/
cd ./model_from_hf/llama-3-8b-hf/
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/config.json
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/generation_config.json
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00001-of-00004.safetensors
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00002-of-00004.safetensors
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00003-of-00004.safetensors
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model-00004-of-00004.safetensors
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/model.safetensors.index.json
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/special_tokens_map.json
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer.json
wget https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer_config.json
cd ../../
```
4. weight conversion in ptd mode
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-3-8b model weight conversion in ptd as an example.*
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# convert to ptd weights
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir ./model_from_hf/llama-3-8b-hf/ \
--save-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
--tokenizer-model ./model_from_hf/llama-3-8b-hf/tokenizer.json
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/llama-3-8b-hf/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/llama-3-8b-hf/mg2hg/
```
Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
5. pre-training
5.1 Prepare dataset
Download the LLAMA3-8B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/llama-3-8b-hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
--output-prefix ./dataset/llama-3-8b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 pre-training using ptd mode
Config LLAMA3-8B pre-training script: examples/llama3/pretrain_llama3_8b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/llama-3-8b-hf/"
TOKENIZER_MODEL="./model_from_hf/llama-3-8b-hf/" #tokenizer path
DATA_PATH="./dataset/llama-3-8b-hf/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1/" #weight path
```
Multi-machine training requires the addition of parameter --overlap-grad-reduce
Launch LLAMA3-8B pre-training script: examples/llama3/pretrain_llama3_8b_ptd.sh
```shell
bash examples/llama3/pretrain_llama3_8b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
6. fine-tuning
6.1 Prepare fine-tuning dataset
Download the LLAMA3-8B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/llama-3-8b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-8b-hf/ \
--output-prefix ./finetune_dataset/llama-3-8b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_llama3_8b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
DATA_PATH="./finetune_dataset/llama-3-8b-hf/alpaca"
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
CKPT_PATH="./ckpt/llama-3-8b-hf/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### Performance
#### Machine performance
The performance of LLaMA3-8B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (tokens/s/p) |
| :------: | :-----------: |:-------------------: | :-------------------------: |
| NPUs | LLaMA3-8B | 1000 | 2483 |
| Reference | LLaMA3-8B | 1000 | 2674 |
## Inference-8B
Config llama3-8B inference script: examples/llama3/generate_llama3_8b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1"
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/"
```
Launch llama3-8B inference script: examples/llama3/generate_llama3_8b_ptd.sh
```bash
bash examples/llama3/generate_llama3_8b_ptd.sh
```
**When want to chat with Llama3please download Instruct/Chat weightsand convert to Megatron weights. Then launch the chat script: examples/llama3/generate_llama3_8b_chat_ptd.sh**
Download the LLAMA3-8B-Instruct weight from [here](https://huggingface.co/unsloth/llama-3-8b-Instruct)
Chat example
```
You >> 你好
ModelLink:
你好!我是你的中文语言模型,很高兴和你交流!有什么想聊的题目或问题,请随时问我!
You >> 介绍一下llama3
ModelLink:
LLaMA 3 is a large-scale language model developed by Meta AI, a leading artificial intelligence research organization. It's a significant upgrade to the previous LLaMA models, offering improved performance and capabilities.
Here are some key features and facts about LLaMA 3:
Scale: LLaMA 3 is a massive model, consisting of 13.5 billion parameters, making it one of the largest language models in the world.
Architecture: It's based on the Transformer architecture, which is a type of neural network designed specifically for natural language processing tasks.
Training data: LLaMA 3 was trained on a massive dataset of text from the internet, including books, articles, and websites. This training data allows the model to learn about various topics, styles, and language patterns.
Language understanding: LLaMA 3 is capable of understanding natural language, including nuances, idioms, and figurative language. It can also recognize and generate text in multiple languages.
Conversational capabilities: The model can engage in conversations, answering questions, and responding to prompts. It can also generate text based on a given topic or theme.
Creative writing: LLaMA
You >> 能用中文回答吗
ModelLink:
LLaMA 3 是 Meta AI 开发的一种大规模语言模型,是之前 LLama 模型的升级版本,提供了改进的性能和能力。
以下是 LLama 3 的一些关键特点和事实:
规模LLaMA 3 是一个庞大的模型,具有 13.5 亿参数,世界上最大的语言模型之一。
架构:它基于 Transformer 架构,这是一种专门为自然语言处理任务设计的神经网络。
训练数据LLaMA 3 在互联网上的大量文本数据上进行了训练,包括书籍、文章和网站。这使得模型能够学习各种主题、风格和语言模式。
语言理解LLaMA 3 能够理解自然语言,包括 nuance、idiom 和 figurative language。它也能识别和生成多种语言的文本。
对话能力:模型能够参与对话,回答问题和响应提示。它也能基于给定的主题或主题生成文本。
创作写作LLa
```
## Evaluation-8B
We use MMLU benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/cais/mmlu).
Config llama3-8B evaluation script: examples/llama3/evaluate_llama3_8b_ptd.sh
```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
TOKENIZER_PATH="./model_from_hf/llama-3-8b-hf/" #tokenizer path
CHECKPOINT="./model_weights/llama-3-8b-hf-v0.1-tp8-pp1" #model path
# configure task and data path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Launch llama3-8B evaluation script:
```bash
bash examples/llama3/evaluate_llama3_8b_ptd.sh
```
Evaluation results
| dataset | subject_num | question_num | reference_acc |NPU acc|
|:---:|:-----------:|:------------:|:-------------:|:---:|
| MMLU | 57 | 14042 | 0.666 |0.653|
# LLAMA3-70B
## Training
Here's a hardware summary of pre-training LLAMA3-70B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 64 x Ascend NPUs |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Build environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
```text
# original deepspeed/runtime/engine.py, about #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# modified
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3. Prepare pretrained weights and tokenizer
Download the LLAMA3-70B checkpoint from [here](https://huggingface.co/v2ray/Llama-3-70B/tree/main)
```shell
#!/bin/bash
mkdir ./model_from_hf/llama-3-70b-hf/
cd ./model_from_hf/llama-3-70b-hf/
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/config.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/generation_config.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00001-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00002-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00003-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00004-of-00030.safetensors
...
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model-00030-of-00030.safetensors
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/model.safetensors.index.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/special_tokens_map.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer.json
wget https://huggingface.co/v2ray/Llama-3-70B/raw/main/tokenizer_config.json
cd ../../
```
4. weight conversion in ptd mode
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-3-70b model weight conversion in ptd as an example.*
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# convert to ptd weights
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 8 \
--load-dir ./model_from_hf/llama-3-70b-hf/ \
--save-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
--tokenizer-model ./model_from_hf/llama-3-70b-hf/tokenizer.json
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--save-dir ./model_from_hf/llama-3-70b-hf/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/llama-3-70b-hf/mg2hg/
```
Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters `target-tensor-parallel-size` and `target-pipeline-parallel-size` according to different tasks.
5. pre-training
5.1 Prepare dataset
Download the LLAMA3-70B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/llama-3-70b-hf/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
--output-prefix ./dataset/llama-3-70b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5.2 pre-training using ptd mode
Config LLAMA3-70B pre-training script: examples/llama3/pretrain_llama3_70b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/llama-3-70b-hf/"
TOKENIZER_MODEL="./model_from_hf/llama-3-70b-hf/" #tokenizer path
DATA_PATH="./dataset/llama-3-70b-hf/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/llama-3-70b-hf-v0.1-tp8-pp8/" #weight path
```
Multi-machine training requires the addition of parameter --overlap-grad-reduce
Launch LLAMA3-70B pre-training script: examples/llama3/pretrain_llama3_70b_ptd.sh
```shell
bash examples/llama3/pretrain_llama3_70b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
6. fine-tuning
6.1 Prepare fine-tuning dataset
Download the LLAMA3-70B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./finetune_dataset/llama-3-70b-hf/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/llama-3-70b-hf/ \
--output-prefix ./finetune_dataset/llama-3-70b-hf/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF \
--handler-name GeneralInstructionHandler \
--append-eod
```
6.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_llama3_70b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` so that fine-tuning starts from the first step.
```bash
DATA_PATH="./finetune_dataset/llama-3-70b-hf/alpaca"
TOKENIZER_PATH="/model_from_hf/llama-3-70b-hf/"
CKPT_PATH="./ckpt"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset \
--tokenizer-not-use-fast \
```
### Performance
#### Machine performance
The performance of LLaMA3-70B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (tokens/s/p) |
| :------: | :-----------: |:-------------------: | :-------------------------: |
| NPUs | LLaMA3-70B | 1000 | 283 |
| Reference | LLaMA3-70B | - | 355 |
## Inference-70B
Config llama3-70B inference script: examples/llama3/generate_llama3_70b_ptd.sh
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1"
TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/"
```
Launch llama3-70B inference script: examples/llama3/generate_llama3_70b_ptd.sh
```bash
bash examples/llama3/generate_llama3_70b_ptd.sh
```
## Evaluation-70B
We use MMLU benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/cais/mmlu).
Config llama3-70B evaluation script: examples/llama3/evaluate_llama3_70b_ptd.sh
```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script model path and tokenizer path
TOKENIZER_PATH="./model_from_hf/llama-3-70b-hf/" #tokenizer path
CHECKPOINT="./model_weights/llama-3-70b-hf-v0.1-tp8-pp1" #model path
# configure task and data path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Launch llama3-70B evaluation script:
```bash
bash examples/llama3/evaluate_llama3_70b_ptd.sh
```
Evaluation results
| dataset | subject_num | question_num | reference_acc |NPU acc|
|:---:|:-----------:|:------------:|:-------------:|:-------:|
| MMLU | 57 | 14042 | 0.795 | 0.783 |

View File

@ -1,60 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# modify script model path and tokenizer path
TOKENIZER_PATH="your tokenizer directory path"
CHECKPOINT="your model directory path"
# configure task and data path
DATA_PATH="/../mmlu/test/"
TASK="mmlu"
# distributed config
MASTER_ADDR=localhost
MASTER_PORT=6011
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--load ${CHECKPOINT} \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--max-new-tokens 1 \
--evaluation-batch-size 1 \
--micro-batch-size 1 \
--use-fused-rmsnorm \
--no-masked-softmax-fusion \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--num-layers 80 \
--hidden-size 8192 \
--ffn-hidden-size 28672 \
--num-attention-heads 64 \
--group-query-attention \
--num-query-groups 8 \
--swiglu \
--disable-bias-linear \
--position-embedding-type rope \
--rotary-base 500000 \
--normalization RMSNorm \
--untie-embeddings-and-output-weights \
--make-vocab-size-divisible-by 16032 \
--bf16 \
--seed 42 | tee logs/evaluation_llama3_70b_${TASK}.log

View File

@ -1,60 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# modify script model path and tokenizer path
TOKENIZER_PATH="your tokenizer directory path"
CHECKPOINT="your model directory path"
# configure task and data path
DATA_PATH="/../mmlu/test/"
TASK="mmlu"
# distributed config
MASTER_ADDR=localhost
MASTER_PORT=6011
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--load ${CHECKPOINT} \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--max-new-tokens 1 \
--evaluation-batch-size 1 \
--micro-batch-size 1 \
--use-fused-rmsnorm \
--no-masked-softmax-fusion \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--swiglu \
--disable-bias-linear \
--position-embedding-type rope \
--rotary-base 500000 \
--normalization RMSNorm \
--untie-embeddings-and-output-weights \
--make-vocab-size-divisible-by 16032 \
--bf16 \
--seed 42 | tee logs/evaluation_llama3_8b_${TASK}.log

View File

@ -1,59 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export WITHOUT_JIT_COMPILE=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--use-fused-swiglu \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--load ${CHECKPOINT} \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--num-layers 80 \
--hidden-size 8192 \
--ffn-hidden-size 28672 \
--position-embedding-type rope \
--rotary-base 500000 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--group-query-attention \
--num-query-groups 8 \
--micro-batch-size 1 \
--num-attention-heads 64 \
--swiglu \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--hidden-dropout 0 \
--attention-dropout 0 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 16032 \
--bf16 \
--seed 42 \
| tee logs/generate_llama3_70b.log

View File

@ -1,64 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export WITHOUT_JIT_COMPILE=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--task chat \
--hf-chat-template \
--add-eos-token '<|eot_id|>' \
--top-p 0.9 \
--temperature 0.6 \
--use-fused-swiglu \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--load ${CHECKPOINT} \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--position-embedding-type rope \
--rotary-base 500000 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--group-query-attention \
--num-query-groups 8 \
--micro-batch-size 1 \
--num-attention-heads 32 \
--swiglu \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--hidden-dropout 0 \
--attention-dropout 0 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 16032 \
--bf16 \
--seed 42 \
| tee logs/generate_llama3_8b.log

View File

@ -1,59 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export WITHOUT_JIT_COMPILE=1
# please fill these path configurations
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS inference.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--use-fused-swiglu \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--load ${CHECKPOINT} \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--position-embedding-type rope \
--rotary-base 500000 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--max-new-tokens 256 \
--group-query-attention \
--num-query-groups 8 \
--micro-batch-size 1 \
--num-attention-heads 32 \
--swiglu \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--hidden-dropout 0 \
--attention-dropout 0 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--make-vocab-size-divisible-by 16032 \
--bf16 \
--seed 42 \
| tee logs/generate_llama3_8b.log

View File

@ -1,96 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=8
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=8
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--micro-batch-size 1 \
--global-batch-size 512 \
--sequence-parallel \
--use-flash-attn \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--use-fused-rmsnorm \
--use-fused-swiglu \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--num-layers 80 \
--hidden-size 8192 \
--ffn-hidden-size 28672 \
--num-attention-heads 64 \
--group-query-attention \
--num-query-groups 8 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--make-vocab-size-divisible-by 16032 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--rotary-base 500000 \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--swiglu \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--lr 1.25e-6 \
--train-iters 1000 \
--lr-decay-style cosine \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 10000 \
--eval-interval 10000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_llama3_70b.log

View File

@ -1,96 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--micro-batch-size 2 \
--global-batch-size 64 \
--sequence-parallel \
--use-flash-attn \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--use-fused-rmsnorm \
--use-fused-swiglu \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--make-vocab-size-divisible-by 16032 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--rotary-base 500000 \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--swiglu \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--lr 1.25e-6 \
--train-iters 2000 \
--lr-decay-style cosine \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 10000 \
--eval-interval 10000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_llama3_8b.log

View File

@ -1,301 +0,0 @@
# Mistral-7B-32K【昇腾贡献模型】
<p align="left">
<b>简体中文</b> |
<b><a href="README_en.md">English</a> </b>
</p>
# 目录
- [Mistral](#mistral)
- [目录](#目录)
- [7B](#7B-32K)
- [硬件要求](#硬件要求)
- [准备工作](#准备工作)
- [模型训练](#模型训练)
- [模型性能](#模型性能)
- [吞吐](#吞吐)
- [模型推理](#模型推理)
- [模型评估](#模型评估)
# 7B-32K
## 硬件要求
训练的推荐硬件配置:
| 硬件 | 配置 |
| :--: | :--------------: |
| NPU | 8 x Ascend NPUs |
推理的推荐硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 8 x Ascend NPUs |
## 准备工作
1. 克隆仓库到本地服务器
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. 搭建环境
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
```
3. 下载 Mistral-7B 的 [预训练权重和词表](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main)*建议仅下载使用safetensors格式的权重*
```shell
#!/bin/bash
cd ./model_from_hf/
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
cd ..
```
4. 权重转换
HuggingFace权重 --> 任意并行切分策略的Megatron权重
***该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练***
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# HF 转 tp8-pp1
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--load-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--save-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Mistral-7B-Instruct-v0.2/tokenizer.model \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1
```
任意并行切分策略的Megatron权重 --> HuggingFace权重
***该场景一般用于将训练好的megatron模型重新转回HuggingFace格式***
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp8-pp1 转 HF
python tools/checkpoint/convert_ckpt.py \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
--save-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ # <-- 需要填入原始HF模型路径新权重会存于./model_from_hf/Mistral-7B-Instruct-v0.2/mg2hg/
```
## 模型训练
准备数据集
下载Alpaca[数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
mkdir ./dataset/Mistral-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--output-prefix ./dataset/Mistral-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
配置 Mistral-7B 预训练脚本:***examples/mistral/pretrain_mistral_7b_ptd.sh***
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 根据实际情况配置词表、数据集、模型参数保存路径
DATA_PATH="./dataset/Mistral-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
CKPT_SAVE_DIR="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
# 根据分布式集群实际情况配置分布式参数
GPUS_PER_NODE=8
MASTER_ADDR="your master node IP"
MASTER_PORT=6000
NNODES=1
NODE_RANK="current node id"
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
# 训练并行策略
TP=8
PP=1
```
启动 Mistral-7B 预训练脚本: ***examples/pretrain_mistral_7b_ptd.sh***
```shell
bash examples/mistral/pretrain_mistral_7b_ptd.sh
```
**注意**
1. 如果使用多机训练,且没有设置数据共享,需要在训练启动脚本中增加--no-shared-storage参数设置此参数之后将会根据分布式参数判断非主节点是否需要load数据并检查相应缓存和生成数据。
2. pretrain_mistral_7b_ptd.sh脚本里的训练超参需要根据实际情况调整例如global-batch-size在预训练中需要设置的更大才能达到更好的效果比如256。
微调
下载微调数据集 [这里](https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl)
```shell
# 下载数据集
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
cd ..
# 处理微调数据集
mkdir ./finetune_dataset/Mistral-7B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
--output-prefix ./finetune_dataset/Mistral-7B/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--append-eod \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler \
--workers 4
```
指令微调
微调的配置脚本基本和预训练脚本一致. *区别是数据集,以及增加训练参数--is-instruction-dataset
增加微调参数--finetune增加预训练权重加载参数--load使微调从第一步开始
```bash
DATA_PATH="./finetune_dataset/Mistral-7B/alpaca"
CKPT_PATH="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset
```
## 模型性能
### 吞吐
Mistral-7B-32K(**开启SWA 4096**)在单机8卡上(tp8 pp1) **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 显存占用/p |
| :--: | :----------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
| NPUs | Mistral-7B 32K | 1000 | 0.69 | 2806 | 46.7 | ~44642MB |
| 参考 | Mistral-7B 32K | 1000 | 0.67 | 2734 | 48.0 | ~65500MB |
## 模型推理
首先需要配置推理脚本: ***examples/mistral/generate_mistral_7b_ptd.sh***
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
# 根据实际加载的模型权重修改并行配置
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1
# 注意
该模型为指令遵从度训练后需要配合模板使用基本操作同上仅inference-prompt-type有变化
--inference-prompt-type mixtral
```
然后可直接启动
```bash
bash examples/mistral/generate_mistral_7b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mistral/generate_demo.png)
## 模型评估
使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
配置评估脚本: examples/mistral/evaluate_mistral_7b_ptd.sh
```bash
# ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型参数路径和词表路径
CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
# 配置任务和数据集路径
DATA_PATH="./mmlu/test/"
TASK="mmlu"
```
启动评估
```bash
bash examples/mistral/evaluate_mistral_7b_ptd.sh
```
评估结果如下
| 数据集 | 总问题数 | 参考准确率 | NPU准确率 |
| :----: | :------: | :--------: | :-------: |
| MMLU | 14042 | 0.563 | 0.563 |

View File

@ -1,301 +0,0 @@
# Mistral-7B-32K
<p align="left">
<b><a href="README.md">简体中文</a> </b> |
<b>English</b>
</p>
# Table of Contents
- [Mistral](#mistral)
- [Table of Contents](#table-of-contents)
- [7B](#7B-32K)
- [Hardware-Requirements](#hardware-requirements)
- [Preparation](#preparation)
- [Model-Training](#model-training)
- [Model-Performance](#model-performance)
- [Throughput](#throughput)
- [Model-Inference](#model-inference)
- [Model-Evaluation](#model-evaluation)
# 7B-32K
## Hardware-Requirements
Minimum hardware requirements for training:
| Hardware | Configuration |
| :------: |:---------------:|
| NPU | 8 x Ascend NPUs |
Recommended hardware configuration for inference:
| Hardware | Configuration |
| :------: | :-------------: |
| NPU | 8 x Ascend NPUs |
## Preparation
1. Clone the code repository to the local server
```shell
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
```
2. Set up the environment
```bash
# python3.8
conda create -n test python=3.8
conda activate test
# Install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Download the pre-trained weights and vocabulary for Mistral-7B from [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main). (It is recommended to only download weights in safetensors format)
```shell
#!/bin/bash
cd ./model_from_hf/
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
cd ..
```
4. Weight conversion
HuggingFace weights --> Megatron weights with any parallel slicing strategy
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```bash
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# HF to tp8-pp1
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--load-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--save-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
--tokenizer-model ./model_from_hf/Mistral-7B-Instruct-v0.2/tokenizer.model \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1
```
Any Megatron weights with parallel slicing strategy --> HuggingFace weights
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```bash
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp8-pp1 to HF
python tools/checkpoint/convert_ckpt.py \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/ \
--save-dir ./model_from_hf/Mistral-7B-Instruct-v0.2/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Mistral-7B-Instruct-v0.2/mg2hg/
```
## Model-Training
Prepare dataset
Download the datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet), save to ModelLink/dataset/ directory.
```shell
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Mistral-7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--output-prefix ./dataset/Mistral-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
Configure Mistral-7B pre-training script: ***examples/mistral/pretrain_mistral_7b_ptd.sh***
```shell
# Set the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Configure according to the actual vocabulary, dataset, and model parameter save path
DATA_PATH="./dataset/Mistral-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
CKPT_SAVE_DIR="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
# Configure distributed parameters according to the actual distributed cluster
GPUS_PER_NODE=8
MASTER_ADDR="your master node IP"
MASTER_PORT=6000
NNODES=1
NODE_RANK="current node id"
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
# Training parallel strategy
TP=8
PP=1
```
Start Mistral-7B pre-training script: ***examples/pretrain_mistral_7b_ptd.sh***
```shell
bash examples/mistral/pretrain_mistral_7b_ptd.sh
```
**Note**:
1. If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
2. The hyperparameters for training in the pretrain_mistral_7b_ptd.sh script need to be adjusted according to actual situations. For example, the global-batch-size needs to be set larger during pre-training to achieve better results, such as 256.
Fine-Tuning
Prepare fine-tuning dataset
Download the fine-tuning datasets from [here](https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl)
```shell
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
cd ..
# process datasets
mkdir ./finetune_dataset/Mistral-7B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
--output-prefix ./finetune_dataset/Mistral-7B/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Mistral-7B-Instruct-v0.2/ \
--append-eod \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler \
--workers 4
```
Supervised Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain shell. *The difference is that the dataset and the training parameter is-instruction-dataset are added.*
Add the fine-tuning parameter `--finetune` and add pretrained-weight load parameter `--load`, so that fine-tuning starts from the first step.
```shell
DATA_PATH="./finetune_dataset/Mistral-7B/alpaca"
CKPT_PATH="./ckpt/Mistral-7B-Instruct-v0.2-tp8-pp1/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset
```
## Model-Performance
### Throughput
Comparison of Mistral-7B-32K(**SWA 4096**) performance on 1 nodes and 8 chips with tp8 pp1:
| Device | Model | Iterations | Sample Throughput (samples/step) | Tokens Throughput (tokens/s/p) | Single Step Iteration Time (s/step) | Memory usage/p |
| :--: | :----------: | :----: | :---------------------: | :---------------------: | :-------------------: | :-------------------: |
| NPUs | Mistral-7B 32K | 1000 | 0.69 | 2806 | 46.7 | ~44642MB |
| Reference | Mistral-7B 32K | 1000 | 0.67 | 2734 | 48.0 | ~65500MB |
## Model-Inference
First, configure the inference script: ***examples/mistral/generate_mistral_7b_ptd.sh***
```bash
# Execute set_env.sh according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model weight path and tokenizer path
CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
# Modify according to the actual loaded model weight the parallel configuration
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1
# Note
This model used in this document is an L1 model that requires instruction compliance training and needs to be used with templates. The basic operations are the same as above, only the startup entry has changed:
--inference-prompt-type mixtral
```
Then you can start it directly
```bash
bash examples/mistral/generate_mistral_7b_ptd.sh
```
An example of inference is as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mistral/generate_demo.png)
## Model-Evaluation
Evaluate the model using the MMLU dataset. Dataset download path [here](https://huggingface.co/datasets/cais/mmlu).
Configure the evaluation script: ***examples/mistral/evaluate_mistral_7b_ptd.sh***
```bash
# Ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model parameter path and tokenizer path
CHECKPOINT="./model_weights/Mistral-7B-Instruct-v0.2-tp8-pp1/"
TOKENIZER_MODEL="./model_from_hf/Mistral-7B-Instruct-v0.2/"
# Configure tasks and dataset paths
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
```
Start the evaluation
```bash
bash examples/mistral/evaluate_mistral_7b_ptd.sh
```
The evaluation results are as follows
| Dataset | Dataset | Refer Accuracy | Ours |
| :-----: | :-----: | :------------: | :---: |
| MMLU | 14042 | 0.563 | 0.563 |

View File

@ -1,66 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export TOKENIZERS_PARALLELISM=false
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1
CHECKPOINT="Your ckpt file path"
TOKENIZER_PATH="Your vocab file path"
DATA_PATH="Your data path (such as ./mmlu/test/)"
TASK="mmlu"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--task $TASK \
--task-data-path $DATA_PATH \
--max-new-tokens 1 \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--seq-length 4096 \
--max-position-embeddings 32768 \
--micro-batch-size 1 \
--make-vocab-size-divisible-by 1 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--position-embedding-type rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--swiglu \
--no-gradient-accumulation-fusion \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--load ${CHECKPOINT} \
--no-load-optim \
--no-load-rng \
--bf16 \
--seed 42
"
torchrun $DISTRIBUTED_ARGS evaluation.py \
$GPT_ARGS \
--distributed-backend nccl | tee logs/evaluation_mixtral_${TASK}.log

View File

@ -1,65 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# please fill these path configurations
CHECKPOINT="your model ckpt path"
TOKENIZER_MODEL="your tokenizer path"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layers 32 \
--hidden-size 4096 \
--sliding-window 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--seq-length 4096 \
--max-position-embeddings 32768 \
--micro-batch-size 1 \
--make-vocab-size-divisible-by 1 \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--position-embedding-type rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--swiglu \
--no-gradient-accumulation-fusion \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--load ${CHECKPOINT} \
--no-load-optim \
--no-load-rng \
--bf16
"
torchrun $DISTRIBUTED_ARGS inference.py \
$GPT_ARGS \
$MOE_ARGS \
--distributed-backend nccl \
--inference-prompt-type mixtral \
| tee logs/generate_mixtral.log

View File

@ -1,102 +0,0 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
MASTER_ADDR="your master node IP"
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
echo "NODE_RANK ${NODE_RANK}"
DATA_PATH="your data path"
TOKENIZER_MODEL="your tokenizer path"
CKPT_SAVE_DIR="your model save ckpt path"
CKPT_LOAD_DIR="your model ckpt path"
TP=8
PP=1
NUM_LAYERS=32
GPT_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--sliding-window 4096 \
--num-layers ${NUM_LAYERS} \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_MODEL} \
--seq-length 32768 \
--max-position-embeddings 32768 \
--micro-batch-size 1 \
--global-batch-size 32 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 1000 \
--lr-decay-style cosine \
--untie-embeddings-and-output-weights \
--disable-bias-linear \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--position-embedding-type rope \
--normalization RMSNorm \
--use-fused-rmsnorm \
--use-fused-swiglu \
--use-rotary-position-embeddings \
--use-fused-rotary-pos-emb \
--use-mc2 \
--swiglu \
--use-flash-attn \
--no-masked-softmax-fusion \
--attention-softmax-in-fp32 \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--initial-loss-scale 65536 \
--adam-beta2 0.95 \
--no-gradient-accumulation-fusion \
--no-load-optim \
--no-load-rng \
--use-distributed-optimizer \
--overlap-grad-reduce \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0 \
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 1000 \
--eval-interval 1000 \
--eval-iters 0 \
"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
| tee logs/train_mistral.log

View File

@ -42,10 +42,11 @@
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -70,7 +71,7 @@
# 安装加速库
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
@ -99,7 +100,7 @@
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# HF 转 tp8-pp4-ep1
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_hf \
--saver mixtral \
@ -118,8 +119,8 @@
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp1-pp8-ep2 转 tp1-pp8-ep1
python tools/checkpoint/convert_ckpt.py \
# tp8-pp4-ep1 转 tp8-pp1-ep1
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
@ -138,7 +139,7 @@
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp8-pp4-ep1 转 HF
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
@ -194,7 +195,7 @@
EP=1
```
启动 Mixtral-8x7B 预训练脚本: ***examples/pretrain_mixtral_8x7b_ptd.sh***
启动 Mixtral-8x7B 预训练脚本: ***examples/mixtral/pretrain_mixtral_8x7b_ptd.sh***
```shell
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
@ -254,7 +255,7 @@ Mixtral-8x7B 在四机32卡上(tp8 pp4) **昇腾芯片** 和 **参考芯片**
## 模型推理
首先需要配置推理脚本: ***examples/mixtral/generate_mixtral_8x7b_ptd.sh***
首先需要配置推理脚本: ***tasks/inference/generate_mixtral_8x7b_ptd.sh***
```bash
# 根据您自己的 ascend-toolkit 路径执行set_env.sh
@ -283,16 +284,16 @@ PP=1
然后可直接启动
```bash
bash examples/mixtral/generate_mixtral_8x7b_ptd.sh
bash tasks/inference/generate_mixtral_8x7b_ptd.sh
```
推理的示例如下:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mixtral/generate_demo.png)
![Inference](../../sources/images/mixtral/generate_demo.png)
## 模型评估
使用 MMLU数据集评估模型. 数据集下载路径 [这里](https://huggingface.co/datasets/cais/mmlu).
配置评估脚本: examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
配置评估脚本: tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh
```bash
# ascend-toolkit 路径
@ -309,7 +310,7 @@ TASK="mmlu"
启动评估
```bash
bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
bash tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh
```
评估结果如下

View File

@ -42,10 +42,11 @@ Recommended hardware configuration for inference:
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -70,7 +71,7 @@ Recommended hardware configuration for inference:
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
@ -100,12 +101,12 @@ Recommended hardware configuration for inference:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# HF to tp8-pp4-ep1
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_hf \
--saver mixtral \
--load-dir ./model_from_hf/Mixtral-8x7B/ \
--save-dir ./model_weights/Mixtral-8x7B-v0.1-tp8-pp4-ep1/ \
--save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-tp8-pp4/ \
--tokenizer-model ./model_from_hf/Mixtral-8x7B/tokenizer.model \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 4 \
@ -113,14 +114,14 @@ Recommended hardware configuration for inference:
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to reconfigure the sliced model weights, such as training on a four-node 32-card TP8-PP4 strategy, and then wanting to infer on a single-node 8-card TP8)***
***(This scenario is generally used to reconfigure the sliced model weights, such as training on a dual-node 16-card EP2-PP8 strategy, and then wanting to infer on a single-node 8-card TP8)***
```bash
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp8-pp4-ep1 to tp8-pp1-ep1
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
@ -139,7 +140,7 @@ Recommended hardware configuration for inference:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp8-pp4-ep1 to HF
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
@ -195,7 +196,7 @@ Recommended hardware configuration for inference:
EP=1
```
Start Mixtral-8x7B pre-training script: ***examples/pretrain_mixtral_8x7b_ptd.sh***
Start Mixtral-8x7B pre-training script: ***examples/mixtral/pretrain_mixtral_8x7b_ptd.sh***
```shell
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
@ -255,7 +256,7 @@ Comparison of Mixtral-8x7B performance on 4 nodes and 32 chips with tp8 pp4:
## Model-Inference
First, configure the inference script: ***examples/mixtral/generate_mixtral_8x7b_ptd.sh***
First, configure the inference script: ***tasks/inference/generate_mixtral_8x7b_ptd.sh***
```bash
# Execute set_env.sh according to your own ascend-toolkit path
@ -284,16 +285,16 @@ torchrun $DISTRIBUTED_ARGS inference.py
Then you can start it directly
```bash
bash examples/mixtral/generate_mixtral_8x7b_ptd.sh
bash tasks/inference/generate_mixtral_8x7b_ptd.sh
```
An example of inference is as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/mixtral/generate_demo.png)
![Inference](../../sources/images/mixtral/generate_demo.png)
## Model-Evaluation
Evaluate the model using the MMLU dataset. Dataset download path [here](https://huggingface.co/datasets/cais/mmlu).
Configure the evaluation script: ***examples/mixtral/evaluate_mixtral_8x7b_ptd.sh***
Configure the evaluation script: ***tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh***
```bash
# Ascend-toolkit path
@ -311,7 +312,7 @@ TASK="mmlu"
Start the evaluation
```bash
bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
bash tasks/evaluation/evaluate_mixtral_8x7b_ptd.sh
```
The evaluation results are as follows

View File

@ -1,5 +1,6 @@
#!/bin/bash
export ASCEND_LAUNCH_BLOCKING=1
export WITHOUT_JIT_COMPILE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
@ -37,7 +38,7 @@ MOE_ARGS="
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 0.01 \
--moe-train-capacity-factor 1.1 \
--noisy-gate-policy RSample
--noisy_gate_policy RSample
"
GPT_ARGS="
@ -58,7 +59,7 @@ GPT_ARGS="
--seq-length 32768 \
--max-position-embeddings 32768 \
--micro-batch-size 1 \
--global-batch-size 16 \
--global-batch-size 8 \
--make-vocab-size-divisible-by 1 \
--lr 1.25e-6 \
--train-iters 2000 \

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
# Qwen
# Qwen $\color{black}{\rm\tiny{【Model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Ascend】}}$
<p align="left">
<b><a href="README.md">简体中文</a></b> |
<b>English</b>
@ -48,10 +48,11 @@ Here's a hardware summary of pre-training Qwen-7B:
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -63,73 +64,73 @@ Here's a hardware summary of pre-training Qwen-7B:
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
3. Prepare pretrained weights and tokenizer
Download the Qwen-7B checkpoint from [here](https://huggingface.co/Qwen/Qwen-7B/tree/main)
Download the Qwen-7B checkpoint from [here](https://huggingface.co/Qwen/Qwen-7B/tree/main)
```bash
mkdir ./model_from_hf/Qwen-7B/
cd ./model_from_hf/Qwen-7B/
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_kernel_256.cu
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/config.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/configuration_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cpp_kernels.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/generation_config.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00001-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00002-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00003-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00004-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00005-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00006-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00007-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00008-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model.safetensors.index.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/modeling_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen.tiktoken
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen_generation_utils.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenization_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenizer_config.json
cd ../../
```
```bash
mkdir ./model_from_hf/Qwen-7B/
cd ./model_from_hf/Qwen-7B/
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cache_autogptq_cuda_kernel_256.cu
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/config.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/configuration_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/cpp_kernels.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/generation_config.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00001-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00002-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00003-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00004-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00005-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00006-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00007-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model-00008-of-00008.safetensors
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/model.safetensors.index.json
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/modeling_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen.tiktoken
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/qwen_generation_utils.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenization_qwen.py
wget https://huggingface.co/Qwen/Qwen-7B/resolve/main/tokenizer_config.json
cd ../../
```
Modify line 39 in the modelling_qwen.py file, changing:
Modify line 39 in the modelling_qwen.py file, changing:
```python
SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
```
```python
SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
```
to
to
```python
SUPPORT_FP16 = True
```
```python
SUPPORT_FP16 = True
```
4. Weights convert
Convert weights from huggingface format to megatron format
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
Convert weights from huggingface format to megatron format
***(This scenario is generally used to train open-source HuggingFace models on Megatron)***
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader qwen_hf \
--saver megatron \
@ -140,13 +141,13 @@ Here's a hardware summary of pre-training Qwen-7B:
--add-qkv-bias
```
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy
***(This scenario is generally used to convert the trained megatron model back to the HuggingFace format)***
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
@ -157,7 +158,7 @@ Here's a hardware summary of pre-training Qwen-7B:
--add-qkv-bias \
--save-dir ./model_from_hf/Qwen-7B/ # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-7B/mg2hg/
```
5. Prepare dataset
1. Prepare dataset
Download the Qwen-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
@ -166,7 +167,7 @@ Here's a hardware summary of pre-training Qwen-7B:
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Qwen-7B/
python ./tools/preprocess_data.py \
@ -177,28 +178,28 @@ Here's a hardware summary of pre-training Qwen-7B:
--seq-length 8192 \
--workers 4 \
--log-interval 1000
```
6. pre-training
```
1. pre-training
Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Qwen-7B/"
TOKENIZER_MODEL="./model_from_hf/Qwen-7B/" #tokenizer path
DATA_PATH="./dataset/Qwen-7B/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Qwen-7B-v0.1-tp8-pp1/"
```
```
Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
Config Qwen-7B pre-training script: examples/qwen/pretrain_qwen_7b_ptd.sh
```shell
```shell
bash examples/qwen/pretrain_qwen_7b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
@ -212,8 +213,7 @@ The performance of Qwen-7B in **Ascend NPU** and **Reference**:
| Reference | Qwen-7B | 2867 |
## Inference
Config qwen-7b inference script: examples/qwen/generate_qwen_7b_ptd.sh
Config qwen-7b inference script: tasks/inference/generate_qwen_7b_ptd.sh
```bash
# ascend-toolkit path
@ -224,22 +224,19 @@ CHECKPOINT="./model_weights/Qwen-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Qwen-7B/"
```
Launch qwen-7b inference script: examples/qwen/generate_qwen_7b_ptd.sh
Launch qwen-7b inference script: tasks/inference/generate_qwen_7b_ptd.sh
```bash
bash examples/qwen/generate_qwen_7b_ptd.sh
bash tasks/inference/generate_qwen_7b_ptd.sh
```
**Note**: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_7b_inference.png)
![Inference](../../sources/images/qwen/qwen_7b_inference.png)
## Evaluation
We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
Config qwen-7b evaluation script: examples/qwen/evaluate_qwen_7b_ptd.sh
Config qwen-7b evaluation script: tasks/evaluation/evaluate_qwen_7b_ptd.sh
```bash
# ascend-toolkit path
@ -257,7 +254,7 @@ TASK="mmlu" # "ceval" for ceval task
Launch qwen-7b evaluation
```bash
bash examples/qwen/evaluate_qwen_7b_ptd.sh
bash ./tasks/evaluation/evaluate_qwen_7b_ptd.sh
```
| Task | Subset | Question | OpenSource | NPU |
@ -283,10 +280,11 @@ Here's a hardware summary of pre-training Qwen-14B:
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -298,20 +296,20 @@ Here's a hardware summary of pre-training Qwen-14B:
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
@ -370,8 +368,8 @@ Here's a hardware summary of pre-training Qwen-14B:
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader qwen_hf \
--saver megatron \
@ -388,7 +386,7 @@ Here's a hardware summary of pre-training Qwen-14B:
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
@ -399,7 +397,7 @@ Here's a hardware summary of pre-training Qwen-14B:
--add-qkv-bias \
--save-dir ./model_from_hf/Qwen-14B/ # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-14B/mg2hg/
```
5. Prepare dataset
1. Prepare dataset
Download the Qwen-14B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
@ -408,7 +406,7 @@ Here's a hardware summary of pre-training Qwen-14B:
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Qwen-14B/
python ./tools/preprocess_data.py \
@ -419,15 +417,15 @@ Here's a hardware summary of pre-training Qwen-14B:
--seq-length 2048 \
--workers 4 \
--log-interval 1000
```
6. pre-training
```
1. pre-training
Config Qwen-14B pre-training script: examples/qwen/pretrain_qwen_14b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Qwen-14B/"
TOKENIZER_MODEL="./model_from_hf/Qwen-14B/" #tokenizer path
@ -440,7 +438,7 @@ Here's a hardware summary of pre-training Qwen-14B:
```shell
bash examples/qwen/pretrain_qwen_14b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
### Performance
#### Machine performance
@ -454,7 +452,7 @@ The performance of Qwen-14B in **Ascend NPU** and **Reference**:
## Inference
Config qwen-14b inference script: examples/qwen/generate_qwen_14b_ptd.sh
Config qwen-14b inference script: tasks/inference/generate_qwen_14b_ptd.sh
```bash
# ascend-toolkit path
@ -465,21 +463,20 @@ CHECKPOINT="./model_weights/Qwen-14B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Qwen-14B/"
```
Launch qwen-14b inference script: examples/qwen/generate_qwen_14b_ptd.sh
Launch qwen-14b inference script: tasks/inference/generate_qwen_14b_ptd.sh
```bash
bash examples/qwen/generate_qwen_7b_ptd.sh
bash tasks/inference/generate_qwen_7b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_14b_inference.png)
![Inference](../../sources/images/qwen/qwen_14b_inference.png)
## Evaluation
We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
Config qwen-14b evaluation script: examples/qwen/evaluate_qwen_14b_ptd.sh
Config qwen-14b evaluation script: tasks/evaluation/evaluate_qwen_14b_ptd.sh
```bash
# ascend-toolkit path
@ -497,7 +494,7 @@ TASK="mmlu" # "ceval" for ceval task
Launch qwen-14b evaluation
```bash
bash examples/qwen/evaluate_qwen_14b_ptd.sh
bash ./tasks/evaluation/evaluate_qwen_14b_ptd.sh
```
| Task | Subset | Question | OpenSource | NPU |
@ -524,10 +521,11 @@ Here's a hardware summary of pre-training Qwen-72B:
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
@ -539,20 +537,20 @@ Here's a hardware summary of pre-training Qwen-72B:
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install MindSpeed
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout 2b0edd2
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..
# install other packages
pip install -r requirements.txt
```
@ -592,8 +590,8 @@ Here's a hardware summary of pre-training Qwen-72B:
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader qwen_hf \
--saver megatron \
@ -610,7 +608,7 @@ Here's a hardware summary of pre-training Qwen-72B:
```shell
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
python tools/checkpoint/util.py \
--model-type GPT \
--loader megatron \
--saver megatron \
@ -621,7 +619,8 @@ Here's a hardware summary of pre-training Qwen-72B:
--add-qkv-bias \
--save-dir ./model_from_hf/Qwen-72B/ # Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Qwen-72B/mg2hg/
```
5. Prepare dataset
1. Prepare dataset
Download the Qwen-72B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
@ -648,31 +647,28 @@ Here's a hardware summary of pre-training Qwen-72B:
Config Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Qwen-72B/"
TOKENIZER_MODEL="./model_from_hf/Qwen-72B/" #tokenizer path
DATA_PATH="./dataset/Qwen-72B/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify config according to your own actual situation
CKPT_SAVE_DIR="./ckpt/Qwen-72B/"
TOKENIZER_MODEL="./model_from_hf/Qwen-72B/" #tokenizer path
DATA_PATH="./dataset/Qwen-72B/alpaca_text_document" #processed dataset
CKPT_LOAD_DIR="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
```
To use a 32K sequence, turn on the re-computation feature and change the value of seq-length to 32768. The parameter configuration is as follows:
```shell
--seq-length 32768 \
--recompute-granularity full \
--recompute-method block \
--recompute-num-layers 80 \
```
--recompute-granularity full \
--recompute-method block \
--recompute-num-layers 80 \
```
Launch Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
Launch Qwen-72B pre-training script: examples/qwen/pretrain_qwen_72b_ptd.sh
```shell
bash examples/qwen/pretrain_qwen_72b_ptd.sh
```
**Note**: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter `--no-shared-storage`. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.
```
### Performance
@ -687,8 +683,7 @@ The performance of Qwen-72B in **Ascend NPU** and **Reference**:
## Inference
Config qwen-72b inference script: examples/qwen/generate_qwen_72b_ptd.sh
Config qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh
```bash
# ascend-toolkit path
@ -699,20 +694,19 @@ CHECKPOINT="./model_weights/Qwen-72B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Qwen-72B/"
```
Launch qwen-72b inference script: examples/qwen/generate_qwen_72b_ptd.sh
Launch qwen-72b inference script: tasks/inference/generate_qwen_72b_ptd.sh
```bash
bash examples/qwen/generate_qwen_72b_ptd.sh
bash tasks/inference/generate_qwen_72b_ptd.sh
```
Some inference samples are as follows:
![Inference](https://gitee.com/ascend/ModelLink/raw/master/sources/images/qwen/qwen_72b_inference.png)
![Inference](../../sources/images/qwen/qwen_72b_inference.png)
## Evaluation
We use the [CEval benchmark](https://huggingface.co/datasets/ceval/ceval-exam) and [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu) to evaluate our model.
Config qwen-72b evaluation script: examples/qwen/evaluate_qwen_72b_ptd.sh
Config qwen-72b evaluation script: tasks/evaluation/evaluate_qwen_72b_ptd.sh
```bash
# ascend-toolkit path
@ -730,7 +724,7 @@ TASK="mmlu" # "ceval" for ceval task
Launch qwen-72b evaluation
```bash
bash examples/qwen/evaluate_qwen_72b_ptd.sh
bash ./tasks/evaluation/evaluate_qwen_72b_ptd.sh
```
| Task | Subset | Question | OpenSource | NPU |

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -88,6 +89,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$OUTPUT_ARGS \
--tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
--distributed-backend nccl \
--jit-compile \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_qwen_14b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -89,6 +90,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$OUTPUT_ARGS \
--tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
--distributed-backend nccl \
--jit-compile \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_qwen_72b.log

View File

@ -1,6 +1,7 @@
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NPU_ASD_ENABLE=0
NPUS_PER_NODE=8
MASTER_ADDR=localhost
@ -88,6 +89,5 @@ torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
$OUTPUT_ARGS \
--tokenizer-kwargs 'eos_token' '<|endoftext|>' 'pad_token' '<|extra_0|>' \
--distributed-backend nccl \
--jit-compile \
--save ${CKPT_SAVE_DIR} \
| tee logs/train_qwen_7b.log

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,67 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=1
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
# please fill these path configurations
CHECKPOINT="your model ckpt path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="your data path"
TASK="mmlu"
TP=1
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK} \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--seq-length 8192 \
--max-new-tokens 1 \
--max-position-embeddings 8192 \
--num-layers 24 \
--hidden-size 1024 \
--ffn-hidden-size 2816 \
--num-attention-heads 16 \
--disable-bias-linear \
--swiglu \
--position-embedding-type rope \
--load $CHECKPOINT \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--micro-batch-size 1 \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--add-qkv-bias \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--seed 42 \
--no-chat-template \
| tee logs/eval_qwen15_0point5b_${TASK}.log

View File

@ -1,69 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
# please fill these path configurations
CHECKPOINT="your model ckpt path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="your data path"
TASK="mmlu"
TP=8
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK} \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--seq-length 8192 \
--max-new-tokens 1 \
--max-position-embeddings 8192 \
--num-layers 40 \
--hidden-size 5120 \
--ffn-hidden-size 13696 \
--num-attention-heads 40 \
--disable-bias-linear \
--swiglu \
--position-embedding-type rope \
--load $CHECKPOINT \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--micro-batch-size 1 \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--untie-embeddings-and-output-weights \
--add-qkv-bias \
--make-vocab-size-divisible-by 16 \
--padded-vocab-size 152064 \
--rotary-base 1000000 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--seed 42 \
--bf16 \
--no-chat-template \
| tee logs/eval_qwen15_14b_${TASK}.log

View File

@ -1,69 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=1
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
# please fill these path configurations
CHECKPOINT="your model ckpt path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="your data path"
TASK="mmlu"
TP=1
PP=1
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK} \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--seq-length 8192 \
--max-new-tokens 1 \
--max-position-embeddings 8192 \
--num-layers 24 \
--hidden-size 2048 \
--ffn-hidden-size 5504 \
--num-attention-heads 16 \
--disable-bias-linear \
--swiglu \
--position-embedding-type rope \
--load $CHECKPOINT \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--micro-batch-size 1 \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--untie-embeddings-and-output-weights \
--add-qkv-bias \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--seed 42 \
--bf16 \
--no-chat-template \
| tee logs/eval_qwen15_1point8b_${TASK}.log

View File

@ -1,60 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1800
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
NPU_PER_NODE=8
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPU_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPU_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK \
--master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="./mmlu/data/test"
TASK="mmlu"
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task $TASK \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 64 \
--hidden-size 5120 \
--num-attention-heads 40 \
--ffn-hidden-size 27392 \
--max-position-embeddings 8192 \
--seq-length 8192 \
--padded-vocab-size 152064 \
--rotary-base 1000000 \
--make-vocab-size-divisible-by 1 \
--untie-embeddings-and-output-weights \
--micro-batch-size 1 \
--swiglu \
--disable-bias-linear \
--add-qkv-bias \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--load ${CHECKPOINT} \
--normalization RMSNorm \
--position-embedding-type rope \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--tokenizer-not-use-fast \
--max-new-tokens 1 \
--bf16 \
--group-query-attention \
--num-query-groups 8 \
--no-chat-template \
--seed 42 \
| tee logs/eval_qwen15_32b_${TASK}.log

View File

@ -1,57 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=2
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT="your model ckpt path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="your data path"
TASK="mmlu"
# Different task needs different max_new_tokens value, please follow the instruction in readme.
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task ${TASK}\
--seq-length 8192 \
--max-new-tokens 1 \
--max-position-embeddings 8192 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--num-layers 40 \
--hidden-size 2560 \
--ffn-hidden-size 6912 \
--num-attention-heads 20 \
--disable-bias-linear \
--swiglu \
--position-embedding-type rope \
--load $CHECKPOINT \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--bf16 \
--micro-batch-size 1 \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--untie-embeddings-and-output-weights \
--add-qkv-bias \
--make-vocab-size-divisible-by 1 \
--seed 42 \
--rotary-base 5000000 \
--no-chat-template \
--padded-vocab-size 151936 | tee ./logs/eval_qwen15_4b_${TASK}.log

View File

@ -1,58 +0,0 @@
#!/bin/bash
# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1800
export COMBINED_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Change for multinode config
MASTER_ADDR=localhost
NPU_PER_NODE=8
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPU_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPU_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK \
--master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer path"
DATA_PATH="./mmlu/data/test"
TASK="mmlu"
torchrun $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task $TASK \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 64 \
--hidden-size 8192 \
--num-attention-heads 64 \
--ffn-hidden-size 24576 \
--max-position-embeddings 8192 \
--seq-length 8192 \
--padded-vocab-size 152064 \
--rotary-base 1000000 \
--make-vocab-size-divisible-by 1 \
--untie-embeddings-and-output-weights \
--micro-batch-size 1 \
--swiglu \
--disable-bias-linear \
--add-qkv-bias \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--load ${CHECKPOINT} \
--normalization RMSNorm \
--position-embedding-type rope \
--exit-on-missing-checkpoint \
--no-load-rng \
--no-load-optim \
--tokenizer-not-use-fast \
--max-new-tokens 1 \
--bf16 \
--no-chat-template \
--seed 42 \
| tee logs/eval_qwen15_72b_${TASK}.log

Some files were not shown because too many files have changed in this diff Show More