Serving an LLM
You can deploy a pretrained LLM or a fine-tuned LLM on Aizen. The served LLM will handle text generation or chat completion requests. You can also deploy embeddings models on Aizen.
To serve an LLM or an embeddings model, follow these steps:
Log in to the Aizen Jupyter console. See Using the Aizen Jupyter Console.
Set the current working project.
If the LLM was fine-tuned on Aizen, register the fine-tuned LLM that you want to deploy:
This step is not required if the LLM is a pretrained model from the Hugging Face Hub.
Configure an LLM deployment using the
configure llm
command.In the notebook, you will be guided through a template form with boxes and drop-down lists that you can complete to configure the deployment. You must specify the LLM deployment name and set the type as either llm or embeddings.
If the LLM was fine-tuned on Aizen, set the source type to aizen, and specify the registered model name and version.
If the LLM is a pretrained model from the Hugging Face Hub, set the source type to huggingface, and specify the model name.
Serve the model using the
start llm
command. This command will schedule a job to deploy the model. Optionally, you can configure resources for the job by running theconfigure resource
command. If you do not configure resources, default resource settings will be applied. If you require GPUs, you must use theconfigure resource
command, as GPU resources are not included in the default resource settings.Check the status of the LLM deployment job and obtain serving URLs by running this command:
The base URL in the status output supports a REST API that lists LLMs and embeddings models that are currently being served. The endpoint URL in the status output supports text generation or chat completion requests. Both of these URLs provide the FastAPI docs, Redoc, and OpenAPI paths.
Last updated