Benchmark Evaluations

This guide explains how to evaluate fine-tuned models in Hyperstack AI Studio using benchmark tests. These standardized evaluations compare your model’s performance against selected base models using metrics that assess its ability to solve math problems, reason through scenarios, retrieve facts, and handle other domain-specific tasks. The guide also includes step-by-step instructions for running benchmark evaluations in the UI and an overview of all supported benchmark datasets.

Evaluate Your Model with Benchmarks Using the UI

Follow these steps to create and run a benchmark evaluation on your fine-tuned model:

Access the Model
- Navigate to the Models section in Hyperstack AI Studio.
- Select the model you want to evaluate.
Start a Benchmark Evaluation
- Scroll to the Model Evaluations section.
- Click Benchmark Evaluation.
Configure Benchmark Evaluation
- In the Evaluations Suite section, select one or more of the benchmark datasets that align with your model’s intended use case. Benchmarks cover domains including mathematics, problem-solving, reasoning, general AI, and factual knowledge.
  - See the Benchmark Datasets section below for descriptions of all available datasets.
- Set the percentage of each benchmark dataset to use in the evaluation. The default is 10%, which provides a representative sample while minimizing evaluation time and cost.
Dataset Usage
You can increase the percentage of the benchmark dataset used for evaluation to perform a more thorough test. However, this will also increase evaluation time and cost.
Start the Evaluation
- Click Start Evaluation to begin.
- The evaluation will now appear under Benchmark Evaluations, where you can monitor the status of each pending test.

Interpreting results

Refer to the section below for guidance on interpreting your model’s benchmark results.

Interpreting Benchmark Results

Each benchmark dataset returns a score between 0 and 1, where 1.0 represents perfect performance. These scores reflect how accurately your fine-tuned model completes tasks such as solving equations, reasoning through context, or recalling factual information.

General guidelines for interpreting scores:

Above 0.80: Indicates strong and reliable performance.
0.60 to 0.80: Suggests moderate proficiency with room for improvement.
Below 0.60: May highlight areas where the model is underperforming or needs additional training.

You can use these results to compare different model versions, identify strengths and weaknesses across domains, and inform decisions about retraining, prompt design, or dataset refinement. Always consider the complexity of each benchmark and your model's specific use case when interpreting scores.

Benchmark Datasets

Hyperstack AI Studio includes a curated set of benchmark datasets to evaluate the capabilities of your fine-tuned models across a range of domains. Each benchmark targets a specific skill area, helping you assess performance on real-world and academic tasks.

Mathematics

Evaluate your model’s ability to solve mathematical problems.

MATH – Tests complex mathematical problem-solving skills.
GSM8k – Grade-school level arithmetic and logic questions.

Problem-Solving

Assess general reasoning and programming challenge performance.

BIG-Bench – A broad set of challenging tasks from the BIG-Bench suite.
MMLU (Massive Multitask Language Understanding) – Multidisciplinary tasks across 57 academic subjects.
PIQA (Physical Interaction Question Answering) – Tests reasoning about physical actions and effects.
ARC (AI2 Reasoning Challenge) – Complex science questions requiring multi-step reasoning.
SIQA (Social IQA) – Evaluates situational understanding and social reasoning.
BoolQ – Binary (yes/no) questions requiring passage-level comprehension.
Needle in a Haystack – Assesses ability to retrieve specific information from long-form content. Learn more

Reasoning

Measure skills in logical, commonsense, and reading-based reasoning.

DROP (Discrete Reasoning Over Paragraphs) – Numerical and symbolic reasoning in context.
HellaSwag – Commonsense reasoning through sentence completion.
CommonsenseQA – Everyday knowledge and reasoning for question answering.

General AI

Test broad, general-purpose problem-solving abilities.

AGIEval – Evaluates diverse reasoning skills and general knowledge across domains.

Knowledge

Evaluate factual recall and domain-specific understanding.

TriviaQA – Tests general world knowledge through trivia-style questions.
OpenBookQA – Requires combining known facts with reasoning to answer open-book science questions.

Benchmark Evaluations

In this article

Evaluate Your Model with Benchmarks Using the UI

Interpreting Benchmark Results

Benchmark Datasets

Mathematics

Problem-Solving

Reasoning

General AI

Knowledge

Back to top

Benchmark Evaluations

In this article​

Evaluate Your Model with Benchmarks Using the UI​

Interpreting Benchmark Results​

Benchmark Datasets​

Mathematics​

Problem-Solving​

Reasoning​

General AI​

Knowledge​

Back to top

In this article

Evaluate Your Model with Benchmarks Using the UI

Interpreting Benchmark Results

Benchmark Datasets

Mathematics

Problem-Solving

Reasoning

General AI

Knowledge