Creating Training Datasets for LLMs

A training dataset is required for you to fine-tune an LLM. To create a training dataset, follow these steps:

Log in to the Aizen Jupyter console. See Using the Aizen Jupyter Console.
Create an ML project if you have not already done so or set the current working project.
```
create project <project name>
```
or
```
set project <project name>
```
Configure the dataset by running the configure dataset command:
```
configure dataset
```
In the notebook, you will be guided through a template form with boxes and drop-down lists that you can complete to create features for the dataset.
- If the input to the LLM is a single column in the dataset, then that column can contain the entire input text, including the prompt, or you can configure a prompt template separately during fine-tuning.
- If the input to the LLM is two or more columns from the dataset, then you must configure a prompt template separately during fine-tuning.
Create the training dataset using the start dataset command to schedule a job. Optionally, you can configure resources for the job by running the configure resource command. If you do not configure resources, default resource settings will be applied.
```
configure resource
start dataset <dataset name>
```

Wait for the job to complete, and then check your training dataset:

status dataset <dataset name>
list datasets
display dataset <dataset name>

Last updated 4 months ago