Tools / Evaluations

Evaluations

Evaluations help you systematically measure and improve answer quality by running experiments against curated datasets and comparing results. Navigate to Tools → Evaluations to access the two sub-tabs: Datasets and Experiments .

Upload a dataset.

A dataset is a collection of question-answer pairs that represent the expected behavior of your chatbot. Each pair contains a question and an expected answer.

On the Datasets tab, click Upload dataset.
Provide a name, an optional description, and select a JSON file containing your question-answer pairs.
After upload, the dataset appears in the list showing its name, example count, creation date, and creator.
Click a dataset to view all its examples in a table with each question and expected answer. You can also delete datasets you no longer need.

Run an experiment.

An experiment runs every question in a dataset through the chatbot using the domain's current configuration (model, prompts, retrieval settings) and scores each generated answer against the expected answer.

Switch to the Experiments tab.
Select a dataset from the dropdown and enter an experiment name to identify this run.
Click Run experiment. A progress indicator shows the experiment status (queued, running, completed, or failed) with elapsed time.
When the experiment completes, click View results to open the detailed results view.

Run experiments after changing prompts, model configuration, or indexed content to measure the impact on answer quality. Compare results across experiments to find the best configuration.

Review experiment results.

The results view shows a summary section and a per-example details table.

Summary cards display aggregate metrics for the run:

Total — the number of examples in the dataset.
Successful — examples that received a scored answer.
Failed — examples where the chatbot returned an error or could not generate an answer.
Average scores — per-dimension averages (displayed as percentages) for each scoring dimension returned by the evaluation LLM.
Total tokens — the total token consumption for the experiment run (when available).

The details table lists every example with its question, expected answer, actual answer, per-dimension scores, any error message, execution time, and token counts. Click a row to expand a detailed view showing the full question, both answers, and per-score reasoning from the evaluation LLM. Use the Download CSV button to export all results for offline analysis.

A/B testing in Guides Knowledge AI is achieved by running multiple experiments against the same dataset while varying a single configuration parameter between runs. This lets you isolate the impact of each change and pick the setup that produces the best answers.

Create a baseline experiment — upload or select a dataset and run an experiment with your current configuration. Note the experiment name (e.g. "baseline-gpt4o").
Change one variable — go to Settings → Model configuration and swap the LLM model (e.g. switch from GPT-4o to GPT-4o-mini), adjust a behavioral parameter (temperature, max tokens), or update a prompt in Tools → Prompt manager.
Run a second experiment — return to Experiments, select the same dataset, give the run a descriptive name (e.g. "gpt4o-mini-test"), and click Run experiment.
Compare results — open both experiment results and compare the summary metrics (average scores, success rate, token usage) as well as per-example scores in the details table. Download the CSV for each run if you want to diff results in a spreadsheet.
Iterate — repeat steps 2–4 for each variable you want to test (different model, different temperature, different prompt wording, etc.) until you find the optimal configuration.

Keep your dataset stable across all A/B runs so that score changes reflect configuration differences, not data differences. Name experiments descriptively (include the variable you changed) to make comparisons easier later.

FAQ

What are Evaluations and where do I find them?

Evaluations help you systematically measure and improve answer quality by running experiments against curated datasets and comparing results. To access them, go to Tools → Evaluations, which contains two sub-tabs: Datasets and Experiments.

How do I upload a dataset for Evaluations?

On the Datasets tab, click Upload dataset, then provide a name, an optional description, and select a JSON file containing your question-answer pairs. After upload, the dataset appears in the list with its name, example count, creation date, and creator. You can click a dataset to view its examples (questions and expected answers) and delete datasets you no longer need.

How do I run an experiment using a dataset?

Go to the Experiments tab, select a dataset from the dropdown, and enter an experiment name to identify the run. Click Run experiment and watch the progress indicator for status (queued, running, completed, or failed) and elapsed time. When it completes, click View results to open the detailed results view.

What information is shown in the experiment results summary and details table?

The results view includes summary cards and a per-example details table. Summary cards show Total examples, Successful, Failed, Average scores (per scoring dimension as percentages), and Total tokens (when available). The details table lists each example’s question, expected answer, actual answer, per-dimension scores, any error message, execution time, and token counts, and you can download all results using Download CSV.

How do I A/B test configurations using Evaluations?

Run multiple experiments against the same dataset while changing only one configuration variable between runs. Create a baseline experiment, change one variable (such as model, temperature/max tokens, or a prompt), then run a second experiment with the same dataset and a descriptive name. Compare summary metrics (average scores, success rate, token usage) and per-example scores, and iterate while keeping the dataset stable across runs.

Evaluations

A/B testing configurations▾

FAQ