New here? First see Context & Event schema to understand the input shapes Seer consumes.
Overview
For each logged retrieval event, Seer:- Uses an evaluator model to enumerate the minimal disjoint requirements needed to answer the query (we call this number K).
- Judges which requirements are supported by the retrieved
contextyou sent. - Produces Recall, a Precision proxy, and derived scores (F1, nDCG).
- Model-agnostic (work with any retriever/search stack).
- Label-free (computed without ground-truth annotations).
- Ops-friendly (good default thresholds, alertable, SLO-able).
Evaluator-defined Recall
What it measures: “Did my context include everything needed to answer?” Definition- The evaluator lists requirements:
r1..rK. - For each requirement
ri, it marks it present if at least one passage in yourcontextsupports it. - Recall is:
1.0⇒ every requirement was supported by your retrieved context.< 1.0⇒ at least one needed requirement was missing (Seer flags these).
- K varies by query. Single-hop questions tend to have smaller K; multi-hop and compositional queries have larger K.
- Your
contextcan be eitherstring[]or{ text, ... }[]. If object shape is used, evaluator relies on thetextfield.
Precision (proxy for context bloat)
What it measures: “How much of my context actually helped?” Because there’s no gold set of “all relevant documents,” we use a proxy precision:- If your
contextis large but the evaluator only cites a couple items, precision will be low. This signals context bloat. - If you provide ranks or scores on each context item (e.g.,
rank,score), they help downstream metrics like nDCG.
We will introduce citation precision (based on downstream answer citation) later; the above is retrieval-only.
F1 (derived)
We derive F1 from Recall and the Precision proxy:- Behaves as expected: penalizes if either recall or precision is poor.
nDCG (optional ranks/scores)
If yourcontext items include rank (1 = top) or a score, Seer computes an nDCG-style score using the evaluator’s “supporting/not” signal as graded relevance.
High-level:
- Relevance: 1 for “supporting”, 0 for “not supporting” (or a small fractional value if we add soft support later).
- DCG is computed over your ranked list; IDCG is the ideal ordering (all supporting first).
nDCG = DCG / IDCGin[0,1].
If you don’t provide ranks/scores, we still compute recall/precision/F1. nDCG appears when ranking data is present.
Worked example
Input (your log)-
Requirements (K=2):
- f1) The director of Inception
- f2) The nationality of that director
- Present:
-
f1 supported by
{p1} -
f2 supported by
{p2} - Missing: none
- Recall = 2/2 = 1.0
- Precision proxy = supporting documents
{p1, p2}/ total documents3= 0.67 - F1 =
2*(1.0*0.67)/(1.0+0.67)≈ 0.80 - nDCG (using scores for ordering): supporting documents at positions 1 and 2 ⇒ nDCG ≈ 1.0 (near-ideal order)
Thresholds & alerting (recommended defaults)
- Recall: alert at
< 1.0for high-priority surfaces; or at< 0.8–0.9for lower priority. - Precision (proxy): watch long tails where precision <
0.3–0.4⇒ context bloat. - F1: use as a simple roll-up; alert if it drops by Δ 0.15–0.25 release-over-release.
- nDCG: if you use ranking, alert on material drops (e.g., Δ 0.1).
env, feature_flag, or any metadata field (e.g., tenant, product area). See Alerting for setup instructions.
FAQ
Do I need labeled data? No. The evaluator determines requirements (K) and support directly from your retrieval outputs. Does recall depend on my final LLM’s answer? No. These are retrieval-stage metrics. (Answer-stage citation metrics are planned separately.) How do I improve precision without hurting recall? Tune rankers/rerankers, filter boilerplate, and trim redundant passages. nDCG helps validate ordering changes. What if mycontext is string[] and not objects?
Totally fine. If you later add {id, rank, score}, you’ll unlock deeper analytics like nDCG and per-passage views.