Qwen1.5/docs/source/getting_started/quickstart.rst

158 lines
5.0 KiB
ReStructuredText

Quickstart
==========
This guide helps you quickly start using Qwen1.5. We provide examples of
`Hugging Face Transformers <https://github.com/huggingface/transformers>`__
as well as `ModelScope <https://github.com/modelscope/modelscope>`__, and
`vLLM <https://github.com/vllm-project/vllm>`__ for deployment.
Hugging Face Transformers & ModelScope
--------------------------------------
To get a quick start with Qwen1.5, we advise you to try with the
inference with ``transformers`` first. Make sure that you have installed
``transformers>=4.37.0``. The following is a very simple code snippet
showing how to run Qwen1.5-Chat, with an example of Qwen1.5-7B-Chat:
.. code:: python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-7B-Chat",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")
# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Previously, we use ``model.chat()`` (see ``modeling_qwen.py`` in
previous Qwen models for more information). Now, we follow the practice
of ``transformers`` and directly use ``model.generate()`` with
``apply_chat_template()`` in tokenizer.
If you would like to apply Flash Attention 2, you can load the model as shown below:
.. code:: python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-7B-Chat",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
)
To tackle with downloading issues, we advise you to try with from
ModelScope, just changing the first line of code above to the following:
.. code:: python
from modelscope import AutoModelForCausalLM, AutoTokenizer
Streaming mode for model chat is simple with the help of
``TextStreamer``. Below we show you an example of how to use it:
.. code:: python
...
# Reuse the code before `model.generate()` in the last code snippet
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
streamer=streamer,
)
vLLM for Deployment
-------------------
To deploy Qwen1.5, we advise you to use vLLM. vLLM is a fast
and easy-to-use framework for LLM inference and serving. In the
following, we demonstrate how to build a OpenAI-API compatible API
service with vLLM.
First, make sure you have installed ``vLLM>=0.3.0``:
.. code:: bash
pip install vllm
Run the following code to build up a vllm service. Here we take
Qwen1.5-7B-Chat as an example:
.. code:: bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat
Then, you can use the `create chat
interface <https://platform.openai.com/docs/api-reference/chat/completions/create>`__
to communicate with Qwen:
.. code:: bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
}'
or you can use python client with ``openai`` python package as shown
below:
.. code:: python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen1.5-7B-Chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
]
)
print("Chat response:", chat_response)
Next Step
---------
Now, you can have fun with Qwen models. Would love to know more about
its usages? Feel free to check other documents in this documentation.