Analytics / Compare feedback

Compare feedback

The Compare feedback tab shows a side-by-side analysis of human ratings and AI Judge scores on the same answers. It requires both feedback sources to have rated at least some of the same answers. Key metrics include:

Matched turns — the number of answers that have been rated by both a human and the AI Judge.
Verdict agreement — how often the human and AI agree on whether an answer is good, needs improvement, or is poor.
Overall quality — a Human vs. AI Judge score comparison showing each side's average quality rating and evaluation count.
Per-dimension comparison — a table comparing Human and AI averages for each dimension (Relevance, Completeness, Faithfulness, Coherence) along with the MAE (Mean Absolute Error). An MAE near 0 means strong agreement; 1+ indicates significant disagreement.

Focus on dimensions with high MAE values — these are areas where the AI Judge and your users disagree most, and represent your highest-priority improvement opportunities.

FAQ

What is the Compare feedback tab?

The Compare feedback tab shows a side-by-side analysis of human ratings and AI Judge scores on the same answers. It requires both feedback sources to have rated at least some of the same answers so there is overlap to compare.

What does Matched turns mean in Compare feedback?

Matched turns is the number of answers that have been rated by both a human and the AI Judge. It reflects how many overlapping rated answers are available for comparison.

What is Verdict agreement in Compare feedback?

Verdict agreement measures how often the human and AI agree on whether an answer is good, needs improvement, or is poor. It summarizes alignment between the two feedback sources at the verdict level.

What does the Overall quality metric show in Compare feedback?

Overall quality is a Human vs. AI Judge score comparison that shows each side's average quality rating and evaluation count. It helps you see how the two sources compare in aggregate.

How should I interpret the per-dimension comparison MAE in Compare feedback?

The per-dimension comparison table shows Human and AI averages for each dimension (Relevance, Completeness, Faithfulness, Coherence) along with MAE (Mean Absolute Error). An MAE near 0 means strong agreement, while 1+ indicates significant disagreement. You should focus on dimensions with high MAE values because they indicate where the AI Judge and users disagree most and are the highest-priority improvement opportunities.

Compare feedback

Section▾

FAQ