Resources and GPUs
Last updated
Last updated
The Aizen ML workflow requires scheduling jobs for various stages of an ML pipeline. There are data-sink jobs to load data into data sinks, dataset jobs to create datasets, training jobs to train ML models, and so on. Each job may be assigned to a resource. The resource type is inherited from the job type; that is, data-sink jobs are assigned data-sink resources, training jobs are assigned training resources, and so on. If a job is not assigned a resource, a default resource is created for that job.
Jobs that are assigned to the same resource will share that resource. For example, if two dataset jobs are assigned to the same dataset resource, they will run on the same pod deployment, sharing the CPU and memory resources of that pod. It is highly recommended that you explicitly assigns resources for each job, rather than relying on the default resources. Making long-lived jobs share resources will reduce the deployment cost. If GPU (graphics processing unit) resources are required for training, you must create and assign a GPU resource to the job because GPUs are not part of the default resources.
The configure resource
command creates a resource or assigns a resource for a job. Using the configure resource
command, you provide the size of the job, such as the number of dataset rows or the rate of prediction requests, and Aizen will automatically select the CPU and memory resources for the job. Advanced settings allow you to explicitly set the CPUs, memory, and number of workers for a resource.
Typically, all resources for the Aizen ML workflow are available on the cluster where the ML pipeline is built and deployed. However, GPUs are an exception due to hardware cost. GPU server farms are becoming common, where access to GPUs is provided as a service. For example, Lambda Labs is a cloud service that provides GPU instances. Even within enterprises, it is not uncommon to install on-prem GPU servers in different subnets and locations from CPU servers, due to power, cooling or other requirements.
Aizen supports remote GPU resources. This feature allows you to configure a GPU resource that is at a location far from the source cluster where the data resides. Remote GPUs are supported for specific job types, such as Deep Learning model training, runtimes for custom model training, and LLM serving. The remote GPU resource is used for the duration of the job, and is released once the job completes.
There are two types of remote GPU resources:
A remote static cluster is a lightweight Aizen cluster that is installed on GPU servers. It does not contain all the components of a normal Aizen cluster installation, but only the components required to run remote jobs. This method is ideal for scenarios where you own the GPU hardware or has a long lease on the GPU hardware.
To configure a remote GPU resource, use the configure resource
command, select the Remote GPU checkbox and select Remote Static Cluster. Enter the IP address of the controller in the remote cluster installation, and provide other optional parameters.
A remote dynamic instance is an On-Demand GPU instance that is obtained from a cloud service provider. After the instance is launched, a lightweight Aizen docker image is run on the instance providing all the functionality required for the job. The instance is terminated and released back to the service provider when the job completes. This method is ideal for scenarios where you rent GPUs on a short-term hourly basis from a service provider.
To configure a remote GPU resource, use the configure resource
command, select the Remote GPU checkbox and select Remote Dynamic Instance. Select the Provider Name, and enter the Instance Type Name and the Region prefix. You must first add the service provider as a GPU provider, using the add cloudprovider
command, before configuring the resource.
When using a remote GPU, you will need to share datasets and models across the source cluster and the remote GPU deployment. To achieve this, there are two mechanisms:
If the source cluster's cloud storage bucket is accessible by the remote GPU deployment, then select Shared Cloud Storage when configuring the resource. With this mechanism, the source cluster's cloud storage bucket name and credentials are passed to the remote GPU deployment, which then directly accesses any datasets from the cloud storage bucket in a read-only mode.
If the source cluster's cloud storage bucket is inaccessible by the remote GPU deployment, then select Export Import Dataset and provide a configured data source as the Export Data Source. The configured data source must be a cloud storage bucket that is accessible from both the source cluster and the remote GPU deployment. With this mechanism, datasets are exported from the source cluster to the Export Data Source location, then imported into the remote GPU deployment.