Evaluations
Evaluations help you systematically measure and improve answer quality by running experiments against curated datasets and comparing results. Navigate to Tools → Evaluations to access the two sub-tabs: Datasets and Experiments .
Upload a dataset.
A dataset is a collection of question-answer pairs that represent the expected behavior of your chatbot. Each pair contains a question and an expected answer.
- On the Datasets tab, click Upload dataset.
- Provide a name, an optional description, and select a JSON file containing your question-answer pairs.
- After upload, the dataset appears in the list showing its name, example count, creation date, and creator.
- Click a dataset to view all its examples in a table with each question and expected answer. You can also delete datasets you no longer need.
Run an experiment.
An experiment runs every question in a dataset through the chatbot using the domain's current configuration (model, prompts, retrieval settings) and scores each generated answer against the expected answer.
- Switch to the Experiments tab.
- Select a dataset from the dropdown and enter an experiment name to identify this run.
- Click Run experiment. A progress indicator shows the experiment status (queued, running, completed, or failed) with elapsed time.
- When the experiment completes, click View results to open the detailed results view.
Review experiment results.
The results view shows a summary section and a per-example details table.
Summary cards display aggregate metrics for the run:
- Total — the number of examples in the dataset.
- Successful — examples that received a scored answer.
- Failed — examples where the chatbot returned an error or could not generate an answer.
- Average scores — per-dimension averages (displayed as percentages) for each scoring dimension returned by the evaluation LLM.
- Total tokens — the total token consumption for the experiment run (when available).
The details table lists every example with its question, expected answer, actual answer, per-dimension scores, any error message, execution time, and token counts. Click a row to expand a detailed view showing the full question, both answers, and per-score reasoning from the evaluation LLM. Use the Download CSV button to export all results for offline analysis.
A/B testing in Guides Knowledge AI is achieved by running multiple experiments against the same dataset while varying a single configuration parameter between runs. This lets you isolate the impact of each change and pick the setup that produces the best answers.
- Create a baseline experiment — upload or select a dataset and run an experiment with your current configuration. Note the experiment name (e.g. "baseline-gpt4o").
- Change one variable — go to Settings → Model configuration and swap the LLM model (e.g. switch from GPT-4o to GPT-4o-mini), adjust a behavioral parameter (temperature, max tokens), or update a prompt in Tools → Prompt manager.
- Run a second experiment — return to Experiments, select the same dataset, give the run a descriptive name (e.g. "gpt4o-mini-test"), and click Run experiment.
- Compare results — open both experiment results and compare the summary metrics (average scores, success rate, token usage) as well as per-example scores in the details table. Download the CSV for each run if you want to diff results in a spreadsheet.
- Iterate — repeat steps 2–4 for each variable you want to test (different model, different temperature, different prompt wording, etc.) until you find the optimal configuration.