ModelLink

Go to file

yuhui 979a633740 !869 Qwen-72B推理评估 Merge pull request !869 from yuhui/modellink		2024-03-07 06:16:34 +00:00
ci	!614 为门禁与流水线增加统一测试框架并补充推理UT	2024-02-20 02:05:23 +00:00
examples	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
megatron	feat: 迁移megatron commitId=bcce6f54	2023-12-29 11:08:34 +08:00
modellink	!874 修改支持不均匀切分patch中的包导入bug	2024-03-06 01:09:27 +00:00
sources/images	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
tasks	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
tests	!890 修改流水线配置路径参数，配置拉起命令路径	2024-03-07 02:21:32 +00:00
tools	!832 修改bloom 176b训练脚本和readme	2024-03-05 06:43:04 +00:00
.gitignore	update .gitignore.	2023-09-26 11:00:39 +00:00
LICENSE	Initial commit	2023-05-25 02:15:25 +00:00
OWNERS	!622 修改OWNER	2024-02-05 07:13:39 +00:00
README.md	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
README_en.md	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
SECURITY.md	fork megatron-deepspeed code.	2023-05-25 14:49:59 +08:00
pretrain_gpt.py	!760 支持Mixtral 8x7B MOE模型	2024-03-04 10:38:57 +00:00
public_address_statement.md	!560 公网URL声明	2024-01-29 13:51:27 +00:00
requirements.txt	!869 Qwen-72B推理评估	2024-03-07 06:16:34 +00:00
setup.py	!557 安全编译选项	2024-01-29 14:01:30 +00:00

README_en.md

简体中文 | English

ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms, kernels, and tasks.

Why ModelLink?

Prepared Models

Currently, the following models has been supported:

Downstream Tasks

Currently, the following downstream tasks have been supported:

Instruction/Pretraining dataset support
Low-parameter fine-tuning
Inference: human-machine dialogue
Evaluation with numerous benchmarks

Quick Start For Model Training

Model Performance

Model	Size	Node	Mode	NPU	Ref.	Scripts
Aquila	7B	1x8	BF16	2849	4078	Train
Baichuan	7B	1x8	FP16	2643	2036	Train
Baichuan	13B	1x8	FP16	1213	824	Train
Baichuan2	7B	1x8	BF16	2598	3936	Train
Baichuan2	13B	1x8	BF16	880	872	Train
Bloom	7B1	1x8	FP16	2034	2525	Train
Bloom	176B	12x8	BF16	100	107	Train
InternLM	7B	1x8	BF16	2943	4078	Train
InternLM	65B	4x8	BF16	342	414	Train
LLaMA	7B	1x8	FP16	3763	3804	Train
	13B	1x8	FP16	1894	2012	Train
	33B	4x8	FP16	621	776	Train
	65B	4x8
	65B	4x8	BF16	348	426	Train
LLaMA2	7B	1x8	BF16	2662	2884	Train
	13B	1x8	BF16	1550	1750	Train
	34B	2x8	BF16	690	796	Train
	70B	8x8	BF16	350	339	Train
Qwen	7B	1x8	BF16	2499	2867	Train
	14B	1x8	BF16	1560	1578	Train
	72B	16x8	BF16	285	345	Train
Mixtral	8x7B	2x8	BF16	1054	1139	Train

Model Training Software

Software	config
Python	3.8.18
driver	2023Q4 Commercial Version
firmware	2023Q4 Commercial Version
CANN	2023Q4 Commercial Version
binary arithmetic package	2023Q4 Commercial Version
torch	2.1.0
torch_npu	2023Q4 Commercial Version

Downstream Tasks

Content List

Model	Size	Fine-tuning	Inference	Evaluation	Dataset Support
Aquila	7B	--	inference	evaluation	alpaca_data.json
Baichuan	7B	--	inference	evaluation	alpaca_data.json
Baichuan	13B	lora	inference	evaluation	alpaca_data.json
Baichuan2	7B	--	inference	evaluation	alpaca_data.json
Baichuan2	13B	--	inference	evaluation	alpaca_data.json
Bloom	7B1	lora	inference	evaluation	alpaca_data.json
Bloom	176B	--	inference	evaluation	alpaca_data.json
InternLM	7B	--	inference	evaluation	alpaca_data.json
LLaMA	7B	lora	inference	evaluation	alpaca_data.json
	13B	lora	inference	evaluation	alpaca_data.json
	33B	lora	inference	evaluation	alpaca_data.json
	65B	lora	inference	evaluation	alpaca_data.json
LLaMA2	7B	lora	inference	evaluation	alpaca_data.json
	13B	lora	inference	evaluation	alpaca_data.json
	34B	lora	inference	evaluation	alpaca_data.json
	70B	lora	inference	evaluation	alpaca_data.json
Qwen	7B	--	inference	evaluation	alpaca_data.json
	14B	--	inference	evaluation	alpaca_data.json
	72B	--	inference	evaluation	alpaca_data.json
Mixtral	8x7B	--	inference	evaluation	alpaca_data.json

Instruction/Pretraining dataset support

Quick Start

# for llama, download alpaca dataset, like
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet

# download tokenizer configs and (selective) weights from
# https://huggingface.co/yahma/llama-7b-hf/tree/main
# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
mkdir dataset
python tools/preprocess_data.py --input train-00000-of-00001-a09b74b3ef9c3b56.parquet \
                                --output-prefix dataset/alpaca \
                                --tokenizer-type PretrainedFromHF \
                                --tokenizer-name-or-path llama-7b-hf \
                                --tokenizer-not-use-fast \
                                --handler-name GeneralInstructionHandler

Preprocessing pretraining dataset

wikipedia dataset

download wikipedia data from huggingface to WORKSPACE/wikipedia
download llama tokenizer model and config from huggingface to WORKSPACE/llama-7b-hf
use preprocessing script to preprocess wikipedia data

# We assume that data and tokenizer has already been downloaded to WORKSPACE.
cd WORKSPACE
mkdir wikipedia_preprocessed

# specify huggingface load_dataset parameters.(--input param will be ignored)
# these params will just be feed into datasets.load_dataset function
hf_config_json="./hf_config_json.json"
cat <<EOT > $hf_config_json
{
    "path": "WORKSPACE/wikipedia",
    "name": "20220301.en",
    "streaming: True,
    "split": "train"
}
EOT

python tools/preprocess_data.py \
    --input "WORKSPACE/wikipedia" \
    --hf-datasets-params ${hf_config_json} \
    --output-prefix WORKSPACE/wikipedia_preprocessed/wikipedia \
    --dataset-impl mmap \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path WORKSPACE/llama-7b-hf \
    --tokenizer-not-use-fast \
    --streaming \
    --workers 8

After preprocessing, there will be a wikipedia_text_document.bin and a wikipedia_text_document.idx in the WORKSPACE/wikipedia_preprocessed dictionary. Then, we can train a model with --data-path WORKSPACE/wikipedia_preprocessed/wikipedia_text_document flag.

Note that datasets in huggingface have a format like this. The name of the text field of the dataset can be changed by using the --json-key flag which default is text. In wikipedia dataset, it has four columns, including id, url, title and text, where we can choose a column used for training by --json-key flag.

alpaca dataset

Besides, we can also use alpaca dataset for pretraining as below.

python tools/preprocess_data.py --input WORKSPACE/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
                                --output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
                                --tokenizer-type PretrainedFromHF \
                                --tokenizer-name-or-path WORKSPACE/llama-7b-hf \
                                --tokenizer-not-use-fast \
                                --json-key text

Preprocessing instruction dataset

alpaca dataset

# for llama, download alpaca dataset, like
# wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet

# download tokenizer configs and (selective) weights from
# https://huggingface.co/yahma/llama-7b-hf/tree/main
# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)

cd WORKSPACE
mkdir alpaca_preprocessed
python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
                                --output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
                                --tokenizer-type PretrainedFromHF \
                                --tokenizer-name-or-path WORKSPACE/llama-7b-hf \
                                --tokenizer-not-use-fast \
                                --handler-name GeneralInstructionHandler \
                                --append-eod

After preprocessing, there will be three bin files and three idx files in the WORKSPACE/alpaca_preprocessed dictionary. Then, we can train a model with --data-path WORKSPACE/alpaca_preprocessed/alpaca and --is-instruction-dataset flags. In addition, we have developed the dynamic padding function based on the instruction dataset, which can be implemented using the --variable-seq-lengths flag.

Note that instruction dataset has a --handler-name GeneralInstructionHandler flag which will choose GeneralInstructionHandler class to create prompt in modellink/data/data_handler.py. If you have an alpaca-style dataset which have instruction, input and output columns, just use GeneralInstructionHandler. In addition, BelleMultiTurnInstructionHandler is used to handle belle dataset, MOSSInstructionHandler is used to handle MOSS dataset and LeetcodePythonInstructionHandler is used to handle Leetcode dataset.

Low-parameter fine-tuning

Lora

Now, we support Lora to fine-tune your models.

First, you need to install version 0.4.0 of the peft library, like this:

pip install peft==0.4.0

When torch==1.11.0, You can also choose to install from the source package in the GitHub repository, so you can modify the setup.py file to avoid some dependency issues.

Next, you just need to add this argument in your script to open Lora:

# Llama example
--lora-target-modules query_key_value dense gate_proj dense_h_to_4h dense_4h_to_h \

There are other Lora related arguments here, you can find their definitions in the PEFT library.

# Llama example
--lora-r 64 \
--lora-alpha 128 \
--lora-modules-to-save word_embeddings output_layer \
--lora-register-forward-hook word_embeddings input_layernorm \

Among them, the argument --lora-register-forward-hook is used to repair the gradient chain break caused by PP. It only needs to be set to the input layer of each PP stage, and the repair will not increase the trainable parameters. The argument --lora-modules-to-save is used for fine-tuning when expanding the vocabulary. If there is no need for this, there is no need to pass in this argument.

Finally, only Lora's parameters are saved after turning on Lora. Similarly, when loading a model, you need to specify the original model weight path and the Lora weight path. Parameters such as the optimizer are subject to those in the Lora weight path.

--load ${ORIGIN_CHECKPOINT} \
--lora-load ${LORA_CHECKPOINT} \

There is an example could be referred.

After using Lora to fine-tune the Llama model, the instruction dialogue effect is as follows:

You >> Give three tips for staying healthy.

ModelLink:

- Start exercising regularly and eat healthy food.
- Get a good eight hours of sleep each night.
- Take medications regularly.

If after completing lora fine-tuning, we need a model without lora structure, then we only need to run this script to merge the two model files --load and --lora-load, and generate a model without lora structure. The new weight model file is stored in the --save path.

Inference: human-machine dialogue

Currently, we support the following four cases of inference:

PTD only
Model fine-tuned with lora

Quick Start

Here are some example scripts in different mode mentioned above for you to launch directly.

Please Note that:

If you want to use the weight from huggingface, please run the weight conversion script first. Take Llama-7B, for example:

PTD only

python tools/checkpoint/util.py --model-type GPT \
                     --loader llama2_hf \
                     --saver megatron \
                     --target-tensor-parallel-size 2 \
                     --target-pipeline-parallel-size 2 \
                     --load-dir ./model_from_hf/llama-7b-hf \
                     --save-dir ./model_weights/llama-7b-tp2-pp2 \
                     --tokenizer-model ./model_from_hf/llama-7b-hf/tokenizer.model

You need to modify some variables in the shell script such as model weight path and vocab path.
- PTD only: In this mode, the model is split by pipeline parallel and tensor parallel mode in megatron ways.
```
sh tasks/inference/generate_llama_7B_ptd.sh
```
- If you want to use lora model, for details, refer to:
```
sh tasks/inference/generate_llama_7b_lora_ptd.sh
```

Some examples with Chinese-LLaMA-Alpaca-13B weights is as below

Usage Guide

Follow these steps to write your own inference code:

Initializing the Distributed Environment

initialize_megatron(args_defaults={'no_load_rng': True, 'no_load_optim': True})

Initializing model and loading weights

from modellink import get_args
from modellink.model import GPTModel
from modellink.arguments import core_transformer_config_from_args


def model_provider(pre_process=True, post_process=True):
    """Build the model."""
    config = core_transformer_config_from_args(get_args())
    init_model = GPTModel(
        config,
        num_tokentypes=0,
        parallel_output=False,
        return_moe_loss=False,
        pre_process=pre_process,
        post_process=post_process
    )
    return init_model


model = GPTModel.from_pretrained(
    model_provider=model_provider,
    pretrained_model_name_or_path="your model weight path"
)

"""
This is an API for initializing model and loading weight.

Parameters:
----------
model_provider(`func`):
    Function used to generate model objects which is similar to the training define.
pretrained_model_name_or_path(`str`, *optional*, defaults to None):
    File path of Model weight in megatron format (TP, PP may be used).
    If it is None, the random initialized weights will be used.
"""

Generate text in HuggingFace-like ways

Greedy Search

responses = model.generate(
    "Write quick sort code in python",
    max_new_tokens=512
)

Do sample with top-k and top-p

responses = model.generate(
    "Write quick sort code in python",
    do_sample=True,
    temperature=1.0,
    top_k=50,
    top_p=0.95,
    max_new_tokens=512
)

Beam search with top-k and top-p

responses = model.generate(
    "Write quick sort code in python",
    num_beams=4,
    top_k=50,
    top_p=0.95,
    max_new_tokens=512
)

Beam search with top-k and top-p sampling

responses = model.generate(
    "Write quick sort code in python",
    do_sample=True,
    temperature=0.6,
    num_beams=4,
    top_k=50,
    top_p=0.95,
    max_new_tokens=512
)

Evaluation with Numerous Benchmarks

Quick Show

Task	Subset	Model	NPU	Reference	Benchmark
BBH	test	Llama7b	0.334	0.333	0.335
AGIEval	test	Llama7b	0.210	0.210	0.206
HumanEval	test	Llama7b	0.128	0.128	0.128
BoolQ	test	Llama7b	0.742	0.742	0.754
GSM8K	test	Llama7b	0.102	0.103	0.100
CEval	val	Llama7b	0.408	0.404	/
MMLU	test	Llama7b	0.333	0.324	0.351

Quick Start

# Configure model path and vocab_file path
# Vocab file can be downloaded from https://huggingface.co/yahma/llama-7b-hf
CHECKPOINT=../models/llama-7b-tp2-pp4/
VOCAB_FILE=../models/llama7b-hf/
# configure task and data path
DATA_PATH="dataset/boolq/test"
TASK="boolq"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --task-data-path $DATA_PATH \
       --task $TASK\
       --seq-length 512 \
       --max-new-tokens 1 \
       --evaluation-batch-size 1 \
       --max-position-embeddings 512 \
       --tensor-model-parallel-size 2  \
       --pipeline-model-parallel-size 4  \
       --num-layers 32  \
       --hidden-size 4096  \
       --ffn-hidden-size 11008 \
       --load ${CHECKPOINT[images](sources%2Fimages)}  \
       --num-attention-heads 32  \
       --tokenizer-type PretrainedFromHF  \
       --tokenizer-name-or-path $VOCAB_FILE \
       --tokenizer-not-use-fast \
       --fp16  \
       --micro-batch-size 1  \
       --seed 42 | tee logs/train.log
# start evaluation
bash tasks/evaluation/eval_llama.sh

Task Introduction

The most important evaluation parameters must be --max-new-tokens, which means the output length of model generation. For example, multiple-choice questions' output length is obviously shorter than coding tasks. Besides, this parameter largely decides the speed of model generation.

python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --task-data-path $DATA_PATH \
       --task $TASK\
       --seq-length 512 \
       --max-new-tokens 1 \
       --evaluation-batch-size 1 \
       --max-position-embeddings 512 \
       --tensor-model-parallel-size 2  \
       --pipeline-model-parallel-size 4  \
       --num-layers 32  \
       --hidden-size 4096  \
       --ffn-hidden-size 11008 \
       --load ${CHECKPOINT}  \
       --num-attention-heads 32  \
       --tokenizer-type PretrainedFromHF  \
       --tokenizer-name-or-path $VOCAB_FILE \
       --tokenizer-not-use-fast \
       --fp16  \
       --micro-batch-size 1  \
       --seed 42 | tee logs/train.log

BoolQ

BoolQ is a question answering dataset for yes/no questions. Each question contains a triplet of (question, passage, answer), with the title of the page as optional additional context. The evaluation of the BoolQ data set is relatively simple, just configure TASK="boolq", --max-new-token=1. The zero-shot results are usually affected by the given prompt, and a higher score can be obtained by a suitable prompt. The prompt can be modified in tasks/evaluation/evaluation.py

# Update new prompt by changing the template
template = {instruction}

MMLU

Since MMLU is a multidisciplinary task and 5 shots are performed, the length of each subject question varies greatly. If you want to run 57 subjects at the same time, you need to set TASK="mmlu", --max-new-token=1. On many websites, the accuracy of the MMLU is evaluated according to disciplines. The 57 categories of single subjects belong to four main categories. Therefore, the statistics should be summarized according to the major categories of the subjects. The website gives the major categories of subjects for 57 categories of subjects.

GSM8K

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The answer of each question is a specific number. Since few shots are performed, the question length is relatively long in GSM8K, and the output answer contains a chain of thoughts, it is necessary to configure TASK="gsm8k"，--max-new-token=200.

HumanEval

HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. Since the answer of HumanEval dataset contains long codes, it is necessary to configure TASK="human_eval", --max-new-token=200.

AGIEval

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.Since the length of answers to different type of questions varies, we have to configure TASK="agieval", --max-new-token=5 to fit the longest answer.

Big-Bench-Hard

Big-bench-hard dataset is a subset of big bench, which is a diverse evaluation suite that focuses on a suite of 23 challenging BIG-Bench tasks. These are the task for which prior language model evaluations did not outperform the average human-rater. This dataset covers multiple areas including text understanding, reasoning, logical reasoning, mathematical reasoning, and common sense reasoning. Except word_sorting, all datasets are multiple-choice questions. So we can set TASK="bbh", --max-new-token=32,--evaluation-batch-size=4.

CEval

As C-Eval shows, C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels, as shown below. You may explore our dataset examples at Explore, or check our paper for more details. The dataset contains validation and test data, however, only validation data has label for auto-evaluation. If you want to evaluate on test data, you should email your results to C-Eval. We can set TASK="ceval", --max-new-token=1.

Configuration of models and datasets

As the example shown below, we want to use llama7b model for BoolQ dataset evaluation, so the model path and vocab file should correspond to llama7b model. Model can be segmented with suitable segmentation parameters: the following example set tensor-model-parallel-size(tp) = 2 and pipeline-model-parallel-size(pp) = 4. Segmentation example shows as followed:

python convert_weights_from_huggingface.py \
        --input-model-dir /home/w425040/models/llama-7b-hf \
        --output-model-dir /home/w425040/models/llama-7b-tp2-pp4 \
        --type 7B \
        --tensor-model-parallel-size 2 \
        --pipeline-model-parallel-size 4

Then, configure dataset path and task. Note: since the evaluation parameters of different datasets are not totally same, it is not recommended to evaluate two or more different datasets together. Evaluation parameters such as --seq-length, --max-new-tokens and --max-position-embeddings need to be adjusted to datasets. The recommended parameters for each dataset will be given in the following instruction.

# configure model path and vocab_file path
CHECKPOINT=../models/llama-7b-tp2-pp4/
VOCAB_FILE=../models/llama7b-hf/
# configure task and data path
DATA_PATH="dataset/boolq/test"
TASK="boolq"
# configure generation parameters

Acknowledgments

ModelLink is jointly contributed by the following departments of Huawei Corporation:

Ascend Computing Product Unit
Algorithm Unit of Computing Product Unit
Research Unit of Computing Product Unit
Open Computing Kit of Computing Product Unit
General Development Department
Global Technical Service Department

We appreciate every PR from community, and welcome to contribute to ModelLink.

Appendix

Inner Function Description: Here are some inner implementation interface introduction InnerInterface
Parameters Description: Here are some parameters description and usage param.
Permission Description: It is recommended that the umask value of Linux be greater than or eqaul to 027. Before running the program, you are advised to take security measures such as permission control for files required for training, such as ckpt, logs and so on. You are advised to run the program or execute commands as a regular user not as root or super user. Also, you are advised to set the folder permission to 750 and the file permission to 640. When multiple users share datasets, set the read and write permissions for folders and files based on the minimum permissions to avoid security problems such as unauthorized access.
Path Description: When you're using interface such as torch.load, unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. We don't suggest you load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Please load data you trust. Moreover, when you need to read data from outside or your specified path you'd better make it trusted and safe, including but not limited to weights path, dataset path.
Communication Matrix: Please refer to this link to check the communication matrix.
Public Network Address: Here is the Public Network Address.

README_en.md Unescape Escape

Why ModelLink?

Prepared Models

Downstream Tasks

Quick Start For Model Training

Model Performance

Model Training Software

Downstream Tasks

Content List

Instruction/Pretraining dataset support

Quick Start

Preprocessing pretraining dataset

wikipedia dataset

alpaca dataset

Preprocessing instruction dataset

alpaca dataset

Low-parameter fine-tuning

Lora

Inference: human-machine dialogue

Quick Start

Usage Guide

Initializing the Distributed Environment

Initializing model and loading weights

Generate text in HuggingFace-like ways

Evaluation with Numerous Benchmarks

Quick Show

Quick Start

Task Introduction

BoolQ

MMLU

GSM8K

HumanEval

AGIEval

Big-Bench-Hard

CEval

Configuration of models and datasets

Acknowledgments

Appendix

README_en.md