Enable Fine-Tuning and Training Features

TOC

Install Plugins

  • Ensure the Volcano cluster plugin is installed.
  • Ensure the MLflow cluster plugin is installed (deploying it requires PostgreSQL).
InstallPlugin

Download the following plugin artifacts from https://cloud.alauda.cn or https://cloud.alauda.io and push these plugins to the ACP platform.

MLFlow: MLFlow tracking server for monitoring training experiments. After installation, an "MLFlow" menu entry will appear in the AML navigation bar. Volcano: Schedules training jobs using various scheduler plugins, including Gang-Scheduling and Binpack.

# Note: replace platform address, username, password, and cluster name accordingly.
violet push --platform-address="https://192.168.171.123" \
--platform-username="admin@cpaas.io" \
--platform-password="platform-password" \
--clusters=g1-c1-gpu \
your-downloaded-package-file.tgz

Go to "Administrator - Marketplace - Upload Packages", switch to the "Cluster Plugins" tab, find the uploaded plugins, and verify that their versions are correctly synced. Then go to "Administrator - Marketplace - Cluster Plugins", locate these plugins, click the "..." button on the right, and select "Install". Complete the setup form if required, then click "Install" to add the plugin to the current cluster.

Enable Features

Navigate to "Administrator - Clusters - Resources", then enter amlcluster in the search box on the left side. Click the "Correlated with Cluster" panel to find the AmlCluster resource. Within the AmlCluster resource, set tuneModels and datasets to true under spec.values.experimentalFeatures.

apiVersion: amlclusters.aml.dev/v1alpha2
kind: AmlCluster
metadata:
  name: default
spec:
  components:
    gateway:
      certificate:
        type: SelfSigned
      domain: '*.example.com'
    knativeServing:
      istioConfig:
        controlPlane:
          autoRevisionMode: legacy
      managementState: Managed
      providerType: Legacy
    kserve:
      managementState: Managed
  values:
    buildkitd:
      storage:
        type: emptyDir
    experimentalFeatures:
      datasets: true
      imageBuilder: false
      pretrain: true
      tuneModels: true
    global:
      deployFlavor: single-node
      gitlabAdminTokenSecretRef:
        name: aml-gitlab-admin-token
        namespace: cpaas-system
      gitlabBaseUrl: https://aml-gitlab.alaudatech.net
      mysql:
        database: aml
        host: mysql.kubeflow
        port: 3306
        user: root
  1. When set to true, the "Datasets" item appears in the left navigation menu.
  2. When set to true, the "Training" item appears in the left navigation menu.
  3. When set to true, the "Fine-Tuning" item appears in the left navigation menu.

Task Templates

  1. Custom template upload: Ensure your custom fine-tuning template files are complete and uploaded to Task Template.
  2. Template authoring guide: For instructions on creating custom templates, refer to the Fine-tuning Template Developing Guide.

Download Templates:

Download the alaudadockerhub/training-templates image, then run the following command to extract example templates:

# Run this command in your terminal. Ensure the nerdctl CLI tool is installed.
# After completion, example templates will be available in the `files` directory under the current path.
nerdctl run --rm --net host -v "$PWD:/dst" \
  docker.io/alaudadockerhub/training-templates:20251119-g6a584922 \
  sh -c 'cp -r /files /dst/'
DANGER

The runtime image is provided for download only. Please import it into the platform image registry before use.

TemplateTask TypeSupported ModelsUse CasesRuntime Image
finetune-object-detectionObject Detectionyolov5 (Community PyTorch version)Suitable for high-density, real-time object localization and classification in images. Applicable in industrial quality inspection, logistics inventory, urban security, smart retail, and agricultural monitoring for millisecond-level anomaly detection and counting statistics.alaudadockerhub/yolov5-runtime:v0 .1.0
finetune-time-series-forecastingTime Series ForecastingAWS Chronos-Bolt-Small (AutoGluon wrapped)Zero-shot pre-trained large time series model that outputs multi-step probabilistic forecasts from historical sequences in retail, energy, finance, etc., without feature engineering, enabling minute-level deployment.alaudadockerhub/autogluon-chronos-rt:v1 .4.0-0
finetune-image-classification-vitImage ClassificationGoogle ViT seriesMainly used for various computer vision tasks such as image classification, object detection, and image segmentation.alaudadockerhub/llm-trainer:v1 .4.4
finetune-text-generation-llamafactoryText GenerationGPT-4o / Llama series (OpenAI / Meta versions)Used for generating text, code, dialogues, and multimodal content, such as chat AI, content creation, code assistance, and personalized recommendation systems.alaudadockerhub/llamafactory-runtime:v1 .5.1
training-object-detection-ultralyticsObject Detectionyolov5 (Community PyTorch version)Suitable for high-density, real-time object localization and classification in images. Applicable in industrial quality inspection, logistics inventory, urban security, smart retail, and agricultural monitoring for millisecond-level anomaly detection and counting statistics.

Upload Templates:

Using finetune-object-detection as an example, follow these steps:

  1. Modify the configuration file: Locate the config.yaml file in the template directory.
  2. Update image references: In config.yaml, update the following fields:
    1. image (training image): Replace the default training image with a YOLOv5 training image available in your AI platform image registry.
    2. tool-image (tool image): Replace the default tool image with a data download/upload tool image available in your AI platform image registry.
  3. Upload the modified finetune-object-detection directory as a template to the AI platform template repository.
WARNING

Ensure the updated image references point to images that the training environment can successfully pull.

Runtime Container Images

The training and data operations rely on specific container images:

  1. Training image
    • Download the image used for training and upload it to your local image repository (some templates may require you to build the image yourself).
    • (Optional, for quick trials) For a fast start, pull and import the provided YoloV5 runtime image: docker.io/alaudadockerhub/yolov5-runtime:v0 .1.0
  2. Tool image (for auxiliary data download and upload)

Add Topics to Task Templates:

To ensure a template displays correctly on the Alauda AI platform, create the following Topics for the template project:

  1. finetune or train
  2. v2
  3. object-detection (indicating the template type)