ModelLink/examples/aquila/README_en.md

8.8 KiB
Raw Blame History

Aquila-7B \color{black}{\rm\tiny{【Model}} \color{black}{\rm\tiny{contributed}} \color{black}{\rm\tiny{by}} \color{black}{\rm\tiny{Ascend】}}

简体中文 | English

Training

Here's a hardware summary of pre-training Aquila-7B:

Hardware Device Value
NPU A+K 8 x Ascend NPUs

Script

  1. Clone the repository to your local server and switch to modellink branch:
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
git checkout 1.0
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
  1. Build conda environment
# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch==2.1.0
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# git clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..

# install other packages
pip install -r requirements.txt
  1. Download the Aquila-7B model, config, and tokenizer from here

save to ModelLink/HF_Aquila7B_downloaded/ directory.

  1. Prepare dataset.

step1: Download the datasets from here, save to ModelLink/dataset/ directory.

cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

step2: use Aquila-7B specified tokenizer to pre-process data:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
    --output-prefix ./dataset/Aquila-7B/alpaca \
    --workers 4 \
    --log-interval 1000  \
    --tokenizer-type PretrainedFromHF
  1. Weights convert

HuggingFace weights --> Megatron weights (This scenario is generally used to train open-source HuggingFace models on Megatron)

# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh

python tools/checkpoint/util.py \
    --model-type GPT \
    --load-dir ./model_from_hf/Aquila-7B/ \
    --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ./model_from_hf/Aquila-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/
  1. Config Aquila-7B pre-training script.

Config the environment variables in aquila pretrain script

# set dataset path, CKPT load path for loading weights, and the tokenizer path
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer path
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #processed dataset
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"   # pointing to the converted model weights
CKPT_SAVE_DIR="./ckpt/Aquila-7B/"                   # pointing to the path to save checkpoints

Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the --load parameter from the training script, and vice versa If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the --save $CKPT_SAVE_DIR parameter from the training script, and vice versa When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.

  1. Launch Aquila-7B pre-training script.

Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.

# you may need to change the path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

Start pre-training Aquila-7B model:

bash examples/aquila/pretrain_aquila_7b_ptd.sh

Note: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter --no-shared-storage. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

Performance

Machine performance

The performance of Aquila-7B in Ascend NPU and reference device:

Device Hardware Model Iterations throughput rate (tokens/p/s) single iteration step time (s/step)
NPU 910b 1node*8p Aquila-7B 1000 2849 5.75
Reference Aquila-7B 1000 2874 5.70

Inference

We support MindSpeed Inference for text generation with Aquila 7B model.

Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script tasks/inference/generate_aquila_7b_ptd.sh. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./HF_Aquila7B_downloaded". In your operation, please fill in correct value based on your actual scenario.

# please change to actual values
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"

Start Aquila-7B Inference:

bash ./tasks/inference/generate_aquila_7b_ptd.sh

Sample results of Aquila-7B Inference:

aquila-7B_generate.png

Evaluation with Benchmark

We use BoolQ benchmark to evaluate our model. You can go to the BoolQ Benchmark page and find the dataset, download it and save it. For example, save to "ModelLink/boolq/test" directory

Evaluation task is similar to inference task tooit also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.

After weight conversion is complete, we configure the Aquila-7B evaluation script tasks/evaluation/evaluate_aquila_7b_ptd.sh. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)

CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"

Start evaluation task

bash ./tasks/evaluation/evaluate_aquila_7b_ptd.sh

Sample Aquila-7B performance running in Ascend NPU:

Task Model NPU Benchmark
BoolQ Aquila-7B 77.3%