Monitor Fine-Tuning & Training Metrics
This article explains how to monitor and access training data for fine-tuning jobs in Hyperstack AI Studio using both the API and the user interface. It covers how to check job status, track training and validation loss, and interpret visualizations to evaluate model performance throughout the training process.
In this article
View Training Metrics Using the UI
You can monitor training metrics in Hyperstack AI Studio during and after the fine-tuning process. While training is in progress, metrics are displayed on the model’s detail page. Once training completes, the full set of data becomes available under the Training Metrics tab in the Model Evaluations section.
The Training Metrics page includes the following:
- Final Metrics – Reports your model’s final training and validation loss values.
- Hyperparameters Used – Lists the configuration settings used during training, such as learning rate and batch size.
- Performance Comparison (Start/End) – Summarizes the change in loss values from before and after fine-tuning.
- Performance Comparison (Loss Chart) – A bar chart showing how much loss decreased through training.
- Model Performance Over Steps – A line graph tracking the training loss reduction over time.
See the Interpreting Training Metrics section for guidance on understanding your model’s fine-tuning results.
Monitor Training Metrics During Fine-Tuning
To view training metrics while a fine-tuning job is in progress, follow these steps:
-
Navigate to the Models page and select the fine-tuned model you want to monitor.
-
During training, real-time progress and metrics are shown directly on the model’s detail page.
- For help understanding these metrics, refer to the Interpreting Training Metrics section below.
Access Metrics After Training Completes
To review training metrics after a fine-tuning job has completed, follow these steps:
-
Once training finishes, navigate to the Models page and select the fine-tuned model you want to monitor.
-
Under the Model Evaluations section on the model’s detail page, click Training Metrics to access the full set of training results.
- For help understanding these metrics, refer to the Interpreting Training Metrics section below.
Interpreting Training Metrics
After training completes, the Training Metrics page displays a comprehensive summary of how your model performed during fine-tuning. The metrics and visualizations are organized into several key sections, each helping you assess different aspects of model behavior.
Final Metrics
- Training Loss: Indicates how well the model fit your training data. Lower values reflect better performance. In many cases, values below 1.0 suggest strong learning.
- Validation Loss: Measures how well the model generalizes to unseen data. Ideally, this should be close to the training loss. A large gap between the two may suggest overfitting.
Hyperparameters Used
These settings define the training configuration and can help explain why the model performed a certain way:
- Learning Rate: The step size for model weight updates. Typical values are around
0.0001
for stable training. - Batch Size: Number of examples processed in one step. Smaller values (e.g.,
4
) are common in constrained environments. - Epochs: The number of full passes over the training data. More epochs can improve learning, but excessive values may overfit.
- Percentage of Dataset for Eval: Fraction of data held out for validation—commonly
5%
. - LoRA Rank (r): Controls the rank of inserted low-rank adapters.
32
–64
is standard for balancing performance and resource usage. - LoRA Alpha: A scaling factor for LoRA updates. Larger values increase the effect of the fine-tuned weights.
- LoRA Dropout: Helps prevent overfitting by adding noise. A value of
0.05
is commonly used. - Gradient Accumulation Steps: Number of steps before backpropagation. Useful for simulating larger batch sizes without increasing memory usage.
- Micro Batch Size: Size of sub-batches within an accumulated step. Smaller values reduce memory load.
Performance Comparison (Start/End of Fine-Tuning)
This section summarizes the change in loss before and after training:
- Training Loss Reduction: Indicates how much better the model performs on its training data post-fine-tuning.
- Validation Loss Reduction: Reflects improved generalization. A strong decrease is desirable.
Performance Comparison (Loss Chart)
Bar chart showing pre- and post-training loss values:
- Before Fine-Tuning (Gray): Baseline loss levels.
- After Fine-Tuning (Blue): Final loss values after model updates.
Interpretation:
- A visible drop in both bars indicates successful fine-tuning.
- Minimal change may indicate ineffective training or data mismatch.
Model Performance Over Steps (Loss Curve)
Line chart visualizing how training loss changed over time:
- A downward-sloping curve signals successful learning progression.
- Spikes or instability can suggest noisy data or poor learning rates.
- A flat or plateauing curve might indicate early convergence or underfitting.
Retrieve Training Metrics API
GET https://api.ai.hyperstack.cloud/api/v1/named-training-info-log/{MODEL_NAME}
This endpoint retrieves training details for any fine-tuned model such as training and validation loss, training status, and performance history. It's helpful for monitoring model performance and debugging training runs.
Replace the following variables before running the command:
API_KEY
: Your API key.{MODEL_NAME}
: Include the name of the model to retrieve training details for in the path of the request.
curl -X GET "https://api.ai.hyperstack.cloud/api/v1/named-training-info-log/{MODEL_NAME}" \
-H "X-API-Key: API_KEY" \
-H "Content-Type: application/json"
Required Parameters
model_name
(string)
– Name of the model to retrieve training logs for.
Response
{
"metrics": {
"end_train_message": [
"Training job ended"
],
"end_train_status": [
"dormant"
],
"eval_loss": [
3.7874512672424316,
2.4864907264709473
],
"eval_perplexity": [],
"loss": [
4.3348,
4.3847,
4.6717,
3.1481,
2.3838
],
"perplexity": []
},
"status": "success"
}
Click to view descriptions of response fields
metrics object
Contains the training and evaluation metrics recorded during the fine-tuning process.
Show child attributes
end_train_message array
A message indicating how the training job concluded. Typically includes phrases like "Training job ended"
or error descriptions if training was interrupted.
end_train_status array
The final status of the training pod. Common values include:
"dormant"
– Training completed and resources were released."failed"
– Training encountered an error and was terminated.
eval_loss array
An array showing validation loss values. Typically includes:
- First value: loss before fine-tuning.
- Second value: loss after fine-tuning.
Lower values generally indicate better generalization performance.
loss array
An array of training loss values recorded during different steps of the fine-tuning process. This sequence shows how the model's performance improved over time. The last value is the final training loss.
eval_perplexity array
(optional) Perplexity values on the validation set before and after fine-tuning. A lower value indicates more confident and accurate predictions. This field may be empty if perplexity is not computed.
perplexity array
(optional) Training perplexity recorded over steps. Not populated in all training runs.
status string
Indicates the result of the API call. "success"
confirms that the training information was retrieved correctly.
{
"metrics": {
"end_train_status": ["failed_training"],
"loss": []
},
"status": "success"
}