Create Fine-tuning Tasks
TOC
Prepare Datasets
Alauda AI Fine-Tuning tasks support reading datasets from S3 storage and Alauda AI datasets. You need to upload your datasets to S3 storage and Alauda AI datasets before creating Fine-Tuning tasks.
dataset format should follow the need that the task templates mentioned, e.g. yoloV5 task template need the dataset formated as coco128 like, and provide a YAML configuration file.
If you are using S3 storage, you need to create a Secret under your namespace like below:
- namespace: Change to your current namespace.
- s3-url: Set to your S3 storage service endpoint and bucket like
https://endpoint:port/bucket. - s3-name: Displays information, for example,
minIO-1 http://localhost:9000/first-bucket, whereminIO-1is thes3-name. - s3-path: Enter the location of the file in the storage bucket, specifying the file or folder. Use '/' for the root directory.
- AWS_ACCESS_KEY_ID: Replace this with your Access Key ID.
- AWS_SECRET_ACCESS_KEY: Replace this with your Secret Access Key.
Steps to Create Fine-Tuning Tasks
- In Alauda AI, go to
Model Optimization→Fine-Tuning. ClickCreate Fine-tuning Task. In the popup dialog, select a template from the dropdown list and clickCreate. - On the fine-tuning task creation page, fill in the form, then click
Create and Run. Check out below table for more information about each field.
Fine-Tuning Form Field Explanation:
Task Status
The task details page provides comprehensive information about each task, including Basic Info, Basic Model, Output Model, Data Configurations, Resource Configuration, and Hyper Parameters Configurations. The Basic Info section displays the task status, which can be one of the following:
- pending: The job is waiting to be scheduled.
- aborting: The job is being aborted due to external factors.
- aborted: The job has been aborted due to external factors.
- running: At least the minimum required pods are running.
- restarting: The job is restarting.
- completing: At least the minimum required pods are in the completing state; the job is performing cleanup.
- completed: At least the minimum required pods are in the completed state; the job has finished cleanup.
- terminating: The job is being terminated due to internal factors and is waiting for pods to release resources.
- terminated: The job has been terminated due to internal factors.
- failed: The job could not start after the maximum number of retry attempts.
Experiment Tracking
The platform provides built-in experiment tracking for training and fine-tuning tasks through integration with MLflow. All tasks executed within the same namespace are logged under a single MLflow experiment named after that namespace, with each task recorded as an individual run. Configuration, metrics, and outputs are automatically tracked during execution.
During training, key metrics are continuously logged to MLflow, you can checkout the real-time metric dashboards in the experiment tracking tab.
In the task detail page, users can access the Tracking tab to view the line charts show how the metrics goes, such as loss or other task-specific indicators, along a unified time axis.
This allows users to quickly assess training progress, convergence behavior, and potential anomalies without manually inspecting logs.
In addition to single-task tracking, the platform supports experiment comparison. Users can select multiple training tasks from the task list and enter a comparison view, where the differences in hyperparameters and other critical configurations are presented side by side. This makes it easier to understand how changes in training settings impact model behavior and outcomes, supporting more informed iteration and optimization of training strategies.
By combining MLflow-based metric tracking with native visualization and comparison features, the platform enables experiments to be observable, comparable, and reproducible throughout the model training lifecycle.