LogoLogo
Have questions?📞 Speak with a specialist.📅 Book a demo now.
  • Welcome
  • INTRODUCTION
    • What Is Aizen?
    • Aizen Platform Interfaces
    • Typical ML Workflow
    • Datasets and Features
    • Resources and GPUs
    • LLM Operations
    • Glossary
  • INSTALLATION
    • Setting Up Your Environment
      • Hardware Requirements
      • Deploying Kubernetes On Prem
      • Deploying Kubernetes on AWS
      • Deploying Kubernetes on GCP
        • GCP and S3 API Interoperability
        • Provisioning the Cloud Service Mesh
        • Installing Ingress Gateways with Istio
      • Deploying Kubernetes on Azure
        • Setting Up Azure Blob Storage
    • Installing Aizen
      • Software Requirements
      • Installing the Infrastructure Components
      • Installing the Core Components
      • Virtual Services and Gateways Command Script (GCP)
      • Helpful Deployment Commands
    • Installing Aizen Remote Components
      • Static Remote Deployment
      • Dynamic Remote Deployment
    • Installing Optional Components
      • MinIO
      • OpenLDAP
      • OpenEBS Operator
      • NGINX Ingress Controller
      • Airbyte
  • GETTING STARTED
    • Managing Users and Roles
      • Aizen Security
      • Adding Users
      • Updating Users
      • Listing Users and Roles
      • Granting or Revoking Roles
      • Deleting Users
    • Accessing the Aizen Platform
    • Using the Aizen Jupyter Console
  • MANAGING ML WORKFLOWS
    • ML Workflow
    • Configuring Data Sources
    • Configuring Data Sinks
    • Creating Training Datasets
    • Performing ML Data Analysis
    • Training an ML Model
    • Adding Real-Time Data Sources
    • Serving an ML Model
    • Training and Serving Custom ML Models
  • MANAGING LLM WORKFLOWS
    • LLM Workflow
    • Configuring Data Sources
    • Creating Training Datasets for LLMs
    • Fine-Tuning an LLM
    • Serving an LLM
    • Adding Cloud Providers
    • Configuring Vector Stores
    • Running AI Agents
  • Notebook Commands Reference
    • Notebook Commands
  • SYSTEM CONFIGURATION COMMANDS
    • License Commands
      • check license
      • install license
    • Authorization Commands
      • add users
      • alter users
      • list users
      • grant role
      • list roles
      • revoke role
      • delete users
    • Cloud Provider Commands
      • add cloudprovider
      • list cloudproviders
      • list filesystems
      • list instancetypes
      • status instance
      • list instance
      • list instances
      • delete cloudprovider
    • Project Commands
      • create project
      • alter project
      • exportconfig project
      • importconfig project
      • list projects
      • show project
      • set project
      • listconfig all
      • status all
      • stop all
      • delete project
      • shutdown aizen
    • File Commands
      • install credentials
      • list credentials
      • delete credentials
      • install preprocessor
  • MODEL BUILDING COMMANDS
    • Data Source Commands
      • configure datasource
      • describe datasource
      • listconfig datasources
      • delete datasource
    • Data Sink Commands
      • configure datasink
      • describe datasink
      • listconfig datasinks
      • alter datasink
      • start datasink
      • status datasink
      • stop datasink
      • list datasinks
      • display datasink
      • delete datasink
    • Dataset Commands
      • configure dataset
      • describe dataset
      • listconfig datasets
      • exportconfig dataset
      • importconfig dataset
      • start dataset
      • status dataset
      • stop dataset
      • list datasets
      • display dataset
      • export dataset
      • import dataset
      • delete dataset
    • Data Analysis Commands
      • loader
      • show stats
      • show datatypes
      • show data
      • show unique
      • count rows
      • count missingvalues
      • plot
      • run analysis
      • run pca
      • filter dataframe
      • list dataframes
      • set dataframe
      • save dataframe
    • Training Commands
      • configure training
      • describe training
      • listconfig trainings
      • start training
      • status training
      • list trainings
      • list tensorboard
      • start tensorboard
      • stop tensorboard
      • stop training
      • restart training
      • delete training
      • list mlflow
      • save embedding
      • list trained-models
      • list trained-model
      • export trained-model
      • import trained-model
      • delete trained-model
      • register model
      • update model
      • list registered-models
      • list registered-model
  • MODEL SERVING COMMANDS
    • Resource Commands
      • configure resource
      • describe resource
      • listconfig resources
      • alter resource
      • delete resource
    • Prediction Commands
      • configure prediction
      • describe prediction
      • listconfig predictions
      • start prediction
      • status prediction
      • test prediction
      • list predictions
      • stop prediction
      • list prediction-logs
      • display prediction-log
      • delete prediction
    • Data Report Commands
      • configure datareport
      • describe datareport
      • listconfig datareports
      • start datareport
      • list data-quality
      • list data-drift
      • list target-drift
      • status data-quality
      • display data-quality
      • status data-drift
      • display data-drift
      • status target-drift
      • display target-drift
      • delete datareport
    • Runtime Commands
      • configure runtime
      • describe runtime
      • listconfig runtimes
      • start runtime
      • status runtime
      • stop runtime
      • delete runtime
  • LLM AND EMBEDDINGS COMMANDS
    • LLM Commands
      • configure llm
      • listconfig llms
      • describe llm
      • start llm
      • status llm
      • stop llm
      • delete llm
    • Vector Store Commands
      • configure vectorstore
      • describe vectorstore
      • listconfig vectorstores
      • start vectorstore
      • status vectorstore
      • stop vectorstore
      • delete vectorstore
    • LLM Application Commands
      • configure llmapp
      • describe llmapp
      • listconfig llmapps
      • start llmapp
      • status llmapp
      • stop llmapp
      • delete llmapp
  • TROUBLESHOOTING
    • Installation Issues
Powered by GitBook

© 2025 Aizen Corporation

On this page
  • On-Demand Remote GPUs
  • Remote Static Clusters
  • Remote Dynamic Instances
  • Remote Data Transfer Mechanism
  • Shared Cloud Storage
  • Export Import Dataset
  1. INTRODUCTION

Resources and GPUs

PreviousDatasets and FeaturesNextLLM Operations

Last updated 3 months ago

The Aizen ML workflow requires scheduling jobs for various stages of an ML pipeline. There are data-sink jobs to load data into data sinks, dataset jobs to create datasets, training jobs to train ML models, and so on. Each job may be assigned to a resource. The resource type is inherited from the job type; that is, data-sink jobs are assigned data-sink resources, training jobs are assigned training resources, and so on. If a job is not assigned a resource, a default resource is created for that job.

Jobs that are assigned to the same resource will share that resource. For example, if two dataset jobs are assigned to the same dataset resource, they will run on the same pod deployment, sharing the CPU and memory resources of that pod. It is highly recommended that you explicitly assigns resources for each job, rather than relying on the default resources. Making long-lived jobs share resources will reduce the deployment cost. If GPU (graphics processing unit) resources are required for training, you must create and assign a GPU resource to the job because GPUs are not part of the default resources.

The configure resource command creates a resource or assigns a resource for a job. Using the configure resource command, you provide the size of the job, such as the number of dataset rows or the rate of prediction requests, and Aizen will automatically select the CPU and memory resources for the job. Advanced settings allow you to explicitly set the CPUs, memory, and number of workers for a resource.

On-Demand Remote GPUs

Typically, all resources for the Aizen ML workflow are available on the cluster where the ML pipeline is built and deployed. However, GPUs are an exception due to hardware cost. GPU server farms are becoming common, where access to GPUs is provided as a service. For example, Lambda Labs is a cloud service that provides GPU instances. Even within enterprises, it is not uncommon to install on-prem GPU servers in different subnets and locations from CPU servers, due to power, cooling or other requirements.

Aizen supports remote GPU resources. This feature allows you to configure a GPU resource that is at a location far from the source cluster where the data resides. Remote GPUs are supported for specific job types, such as Deep Learning model training, runtimes for custom model training, and LLM serving. The remote GPU resource is used for the duration of the job, and is released once the job completes.

There are two types of remote GPU resources:

  • Remote Static Clusters

  • Remote Dynamic Instances

Remote Static Clusters

A remote static cluster is a lightweight Aizen cluster that is installed on GPU servers. It does not contain all the components of a normal Aizen cluster installation, but only the components required to run remote jobs. This method is ideal for scenarios where you own the GPU hardware or has a long lease on the GPU hardware.

To configure a remote GPU resource, use the configure resource command, select the Remote GPU checkbox and select Remote Static Cluster. Enter the IP address of the controller in the remote cluster installation, and provide other optional parameters.

Remote Dynamic Instances

A remote dynamic instance is an On-Demand GPU instance that is obtained from a cloud service provider. After the instance is launched, a lightweight Aizen docker image is run on the instance providing all the functionality required for the job. The instance is terminated and released back to the service provider when the job completes. This method is ideal for scenarios where you rent GPUs on a short-term hourly basis from a service provider.

To configure a remote GPU resource, use the configure resource command, select the Remote GPU checkbox and select Remote Dynamic Instance. Select the Provider Name, and enter the Instance Type Name and the Region prefix. You must first add the service provider as a GPU provider, using the add cloudprovider command, before configuring the resource.

Remote Data Transfer Mechanism

When using a remote GPU, you will need to share datasets and models across the source cluster and the remote GPU deployment. To achieve this, there are two mechanisms:

Shared Cloud Storage

If the source cluster's cloud storage bucket is accessible by the remote GPU deployment, then select Shared Cloud Storage when configuring the resource. With this mechanism, the source cluster's cloud storage bucket name and credentials are passed to the remote GPU deployment, which then directly accesses any datasets from the cloud storage bucket in a read-only mode.

Export Import Dataset

If the source cluster's cloud storage bucket is inaccessible by the remote GPU deployment, then select Export Import Dataset and provide a configured data source as the Export Data Source. The configured data source must be a cloud storage bucket that is accessible from both the source cluster and the remote GPU deployment. With this mechanism, datasets are exported from the source cluster to the Export Data Source location, then imported into the remote GPU deployment.

Shared Cloud Storage
Export Import Dataset
Remote Static Cluster
Remote Dynamic Instances
Shared Cloud Storage
Export Import Dataset