Custom Evaluations

Custom evaluations provide a way to assess your fine-tuned model’s performance on specific tasks defined by your own criteria. You specify the evaluation goal and select relevant data, and Hyperstack AI Studio compares your model’s outputs against a baseline to determine which performs better. This guide covers how to create and run custom evaluations using the UI or API, and how to interpret the results to measure fine-tuning impact.

Create a Custom Evaluation Using the UI

Follow these steps to create and run a custom evaluation on your fine-tuned model:

Access the Model
- Navigate to the Models section in Hyperstack AI Studio.
- Select the model you want to evaluate.
Access Custom Evaluations
- In the Model Evaluations section, click Custom Evaluation.
Create Custom Evaluation
- Click Create Custom Evaluation.
- Provide a unique name for the evaluation.
- Describe the evaluation criteria clearly.
  - Example: "Evaluate responses for kindness and politeness."
Select Evaluation Data
- Choose the data to use for the evaluation. You can select from:
  - All Logs – Use all available logs
  - By Tags – Filter logs by specific tags
  - By Dataset – Use logs from a specific dataset
- Prompts from the selected logs will be used to evaluate the model.
Save and Launch the Evaluation
- Click Save Evaluation to store your configuration.
- The evaluation will now appear under Your Custom Evaluations.
- Select it and click Run Evaluation.
Compare Against a Base Model
- In the popup window, select the model you want to compare your fine-tuned model to.
- Click Confirm and Run.
Review Evaluation Results
- A confirmation message will notify you that the evaluation has started.
- Once complete, results will be shown under the Evaluation Results section on the same page.

Interpreting results

Refer to the section below for guidance on interpreting your model’s evaluation results.

Interpreting Custom Evaluation Results

After running a custom evaluation, results are displayed in a comparison table that summarizes how your fine-tuned model performed relative to a baseline model. Understanding each metric will help you evaluate the impact of your fine-tuning:

Evaluated Model: The fine-tuned model being tested.
Comparison Model: The baseline model selected for side-by-side evaluation.
Improvement %: Indicates the percentage of evaluation prompts where your model outperformed the baseline. A higher value suggests a meaningful improvement due to fine-tuning.
Win / Draw / Loss:
- Win: Number of prompts where the fine-tuned model was judged better than the baseline.
- Draw: Cases where both models produced equivalent outputs.
- Loss: Prompts where the baseline outperformed the fine-tuned model.

Custom evaluations are especially useful for assessing how well your model performs on specific tasks or datasets relevant to your use case. A high win rate and improvement percentage generally indicate strong task alignment, while a high number of draws may suggest minimal difference between models under the given criteria.

If results show little to no improvement, consider reviewing the evaluation criteria, log selection, or training data to ensure they align with your desired model behavior.

Custom Evaluations

In this article

Create a Custom Evaluation Using the UI

Interpreting Custom Evaluation Results

Back to top

Custom Evaluations

In this article​

Create a Custom Evaluation Using the UI​

Interpreting Custom Evaluation Results​

Back to top

In this article

Create a Custom Evaluation Using the UI

Interpreting Custom Evaluation Results