Serving an LLM

You can deploy a pretrained LLM or a fine-tuned LLM on Aizen. The served LLM will handle text generation or chat completion requests. You can also deploy embeddings models on Aizen.

To serve an LLM or an embeddings model, follow these steps:

  1. Log in to the Aizen Jupyter console. See Using the Aizen Jupyter Console.

  2. Set the current working project.

    set project <project name>
  3. If the LLM was fine-tuned on Aizen, register the fine-tuned LLM that you want to deploy:

    list trained-models <ML model name>
    register model <ML model name>,<run id>,PRODUCTION 
    list registered-models

    This step is not required if the LLM is a pretrained model from the Hugging Face Hub.

  4. Configure an LLM deployment using the configure llm command.

    configure llm
  5. In the notebook, you will be guided through a template form with boxes and drop-down lists that you can complete to configure the deployment. You must specify the LLM deployment name and set the type as either llm or embeddings.

    • If the LLM was fine-tuned on Aizen, set the source type to aizen, and specify the registered model name and version.

    • If the LLM is a pretrained model from the Hugging Face Hub, set the source type to huggingface, and specify the model name.

  6. Serve the model using the start llm command. This command will schedule a job to deploy the model. Optionally, you can configure resources for the job by running the configure resource command. If you do not configure resources, default resource settings will be applied. If you require GPUs, you must use the configure resource command, as GPU resources are not included in the default resource settings.

    configure resource
    start llm <llm deployment name>
  7. Check the status of the LLM deployment job and obtain serving URLs by running this command:

    status llm <llm deployment name>
  8. The base URL in the status output supports a REST API that lists LLMs and embeddings models that are currently being served. The endpoint URL in the status output supports text generation or chat completion requests. Both of these URLs provide the FastAPI docs, Redoc, and OpenAPI paths.

Last updated