Skip to main content

Reinforcement Fine-Tuning

Reinforcement Fine-Tuning (RFT) is a post-training technique that leverages reinforcement learning to align large language models with human preferences, enhancing their performance on complex, open-ended tasks.

Our reinforcement fine-tuning function on HPC-AI.COM consists of 7 simple steps that guide you from model selection to deployment. This comprehensive workflow ensures a smooth experience for fine-tuning your models with state-of-the-art reinforcement learning algorithms.

Step 1: Go to Fine-Tuning Page

  1. Log in to HPC-AI.COM.
  2. From the left sidebar, click Fine-Tuning, then select Fine-Tune a Model.

rl-job-page.png

Step 2: Select Model Template

Choose from the built-in model templates. Currently supported:

  • Qwen 3 - 4B
  • Qwen 3 - 8B
  • Qwen 3 - 14B
  • LLaMA 3.2 - 3B - Instruct
  • LLaMA 3 - 8B - Instruct

rl-job-model.png

Step 3: Select Reinforcement Algorithms

All model templates can select one algorithm from GRPO, DAPO, Reinforce++(baseline) and RLOO for reinforcement fine-tuning.

rl-job-algorithms.png

For inquiries about additional templates, please reach out to us at service@hpc-ai.com.

Step 4: Upload Training Data

We provide two convenient methods for uploading your training data:

  • Method 1: Load from Storage (Recommended). If your training data is already stored in cloud storage, you can directly select and import files from your cloud drive without the need for re-uploading.
  • Method 2: Direct Data Upload. You can also upload your local training data files directly using the upload box below.

rl-job-data.png

Example Format for Training Data

The following example illustrates how training data should be constructed. We accept JSONL format with each line having the following structure:

{   
"messages": {
"role": "user",
"content": "Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)."
},
"gt_answer": "0"
}
  • content: Normally math questions
  • gt_answer: Ground truth answers

You can also click the specified format to see the recommended format.

Step 5: Select Compute Resource

Fill in the required fields:

  • GPU Type: Choose from H100 or H200 GPUs, with B200 coming soon.
  • GPU Region: Select your preferred compute region, such as Singapore or United States.
  • Remote Storage: Select a remote storage in the same region as your GPU to read/write training data and models. If no remote storage exists for that region, you will need to create one.

rl-job-compute.png

Step 6: Other Configurations

Fill in the required fields:

  • Job Name: Enter a name for your job.

  • Enter your Weights & Biases API key (Wandb Key) to enable tracking. (Optional but Recommended)

  • Wandb Key allows you to monitor GPU-level metrics such as:

    • GPU frequency
    • Utilization
    • I/O performance

rl-job-other-1.png

You can also click the Advanced Options button to get the hyperparameter setting and also edit it.

rl-job-other-2.png

Step 7: Submit and Monitor Your Job

  • Click Create Job to start the job.
  • Monitor the job status under Job Status.
  • Running indicates that the RFT job is currently in progress.
  • Once the status changes to Succeeded, your fine-tuned model is ready for you.

rl-job-status.png