Evaluation and quality judgment

Runs the ARC Raiders RAG stack against the dataset, logs each variant to Braintrust, and splits retrieval metrics from generation judgment.

Evaluation harness

Labeled ARC Raiders questions, retrieval variants, deterministic retrieval scoring, and an LLM judge for correctness, relevance, and hallucination risk.

Uses the same retrieval and prompt stack as the current RAG demo. Expect a real multi-run evaluation, not a fixture replay.

The harness is ready.

Run the dataset to generate the first retrieval and generation scorecards, then use the same framework to compare prompt, retrieval, rerank, or model changes.