Seer turns unlabeled retrieval traffic into actionable quality signals. This page defines each metric, how we compute it, and how to read it in the app.Documentation Index
Fetch the complete documentation index at: https://docs.seersearch.com/llms.txt
Use this file to discover all available pages before exploring further.
New here? First see Context & Event schema to understand the input shapes Seer consumes.
Overview
For each logged retrieval event, Seer:- Uses an evaluator model to enumerate the minimal disjoint requirements needed to answer the query (we call this number K).
- Judges which requirements are supported by the retrieved
contextyou sent. - Produces Recall, a Precision proxy, and derived scores (F1, nDCG).
- Model-agnostic (work with any retriever/search stack).
- Label-free (computed without ground-truth annotations).
- Ops-friendly (good default thresholds, alertable, SLO-able).
Evaluator-defined Recall
What it measures: “Did my context include everything needed to answer?” Definition- The evaluator lists requirements:
r1..rK. - For each requirement
ri, it marks it present if at least one passage in yourcontextsupports it. - Recall is:
1.0⇒ every requirement was supported by your retrieved context.< 1.0⇒ at least one needed requirement was missing (Seer flags these).
- K varies by query. Single-hop questions tend to have smaller K; multi-hop and compositional queries have larger K.
- Your
contextcan be eitherstring[]or{ text, ... }[]. If object shape is used, evaluator relies on thetextfield.
Precision (proxy for context bloat)
What it measures: “How much of my context actually helped?” Because there’s no gold set of “all relevant documents,” we use a proxy precision:- If your
contextis large but the evaluator only cites a couple items, precision will be low. This signals context bloat. - If you provide ranks or scores on each context item (e.g.,
rank,score), they help downstream metrics like nDCG.
We will introduce citation precision (based on downstream answer citation) later; the above is retrieval-only.
F1 (derived)
We derive F1 from Recall and the Precision proxy:- Behaves as expected: penalizes if either recall or precision is poor.
nDCG (optional ranks/scores)
If yourcontext items include rank (1 = top) or a score, Seer computes an nDCG-style score using the evaluator’s “supporting/not” signal as graded relevance.
High-level:
- Relevance: 1 for “supporting”, 0 for “not supporting” (or a small fractional value if we add soft support later).
- DCG is computed over your ranked list; IDCG is the ideal ordering (all supporting first).
nDCG = DCG / IDCGin[0,1].
If you don’t provide ranks/scores, we still compute recall/precision/F1. nDCG appears when ranking data is present.
Worked example
Input (your log)-
Requirements (K=2):
- f1) The director of Inception
- f2) The nationality of that director
- Present:
-
f1 supported by
{p1} -
f2 supported by
{p2} - Missing: none
- Recall = 2/2 = 1.0
- Precision proxy = supporting documents
{p1, p2}/ total documents3= 0.67 - F1 =
2*(1.0*0.67)/(1.0+0.67)≈ 0.80 - nDCG (using scores for ordering): supporting documents at positions 1 and 2 ⇒ nDCG ≈ 1.0 (near-ideal order)
Thresholds & alerting (recommended defaults)
- Recall: alert at
< 1.0for high-priority surfaces; or at< 0.8–0.9for lower priority. - Precision (proxy): watch long tails where precision <
0.3–0.4⇒ context bloat. - F1: use as a simple roll-up; alert if it drops by Δ 0.15–0.25 release-over-release.
- nDCG: if you use ranking, alert on material drops (e.g., Δ 0.1).
env, feature_flag, or any metadata field (e.g., tenant, product area). See Alerting for setup instructions.
FAQ
Do I need labeled data? No. The evaluator determines requirements (K) and support directly from your retrieval outputs. Does recall depend on my final LLM’s answer? No. These are retrieval-stage metrics. (Answer-stage citation metrics are planned separately.) How do I improve precision without hurting recall? Tune rankers/rerankers, filter boilerplate, and trim redundant passages. nDCG helps validate ordering changes. What if mycontext is string[] and not objects?
Totally fine. If you later add {id, rank, score}, you’ll unlock deeper analytics like nDCG and per-passage views.