For stakeholders
Clear, non-technical overview

Quality & Testing

How we measure answer quality, track retrieval performance, and improve over time.

How the system answers questions

Retrieval metrics (what we track)

Precision@k

Of the top results we showed, how many were truly relevant? Higher = fewer off-topic sources.

Recall@k

How much of the right information did we find? Higher = more complete answers.

Ranking quality (MRR)

Checks whether the best source appears near the top. Higher = less scrolling to the good stuff.

Coverage & timing

Did we find enough recent content when the question asked for "recent" or a time period?

We calculate these per brand and for specific test sets so we can compare apples-to-apples over time.

Answer quality (automatic checks)

We also run an automated reviewer ("LLM judge") in batches. It reads only the same sources we used and scores answers for hallucination, completeness, relevance, and temporal fit. Results are stored for trend tracking.

How we use these metrics

If precision is low

  • Tune ranking weights and boosts
  • Improve query understanding
  • Add brand-specific synonyms

If recall is low

  • Expand content coverage
  • Adjust time-window rules
  • Refine entity/keyword extraction

If ranking is off

  • Rebalance vector vs. keyword signals
  • Boost recency and section cues
  • Strengthen temporal matching

If quality drops

  • Review judge scores and comments
  • Address hallucination/coverage hot spots
  • Tweak prompts and citations

Targets we optimize for

We avoid brand-by-brand score dumps on this page; stakeholders get clear goals and progress instead.

Ground truth test sets

We keep curated test questions per brand. Results (precision, recall, ranking quality and quality scores) are compared over time to validate improvements. Editorial “gold” examples help us lock in high-quality answers.