Skip to main content
RoastIQBuyerLensHugoPricingBlogAbout
Book a demoSign inStart free →
Guide · 12 min · Scoring

Reading a RoastIQ verdict end to end.

An ad scored 62. Three reviewers disagreed. Here is how the decision trace settled it, and what every line of the verdict actually means.

5 KPIs· 1 composite· 1 verdict· 1 decision trace
Oussama Nakhil portrait
Oussama Nakhil, Founder, SaliencyLab
Ex-NielsenIQ, ex-L'Oréal Groupe. Writing from Casablanca + Paris.

01, The frozen 5 KPIs

RoastIQ scores every ad on the same five KPIs in the same order with the same weights. The names are frozen. The weights are frozen. This is not a stylistic choice, it is the only way a score from March can be compared to a score from October, or one client's ad compared to another's at the same benchmark percentile.

25%
01
Beat the Skip
20%
02
Get Noticed
20%
03
Brand Impact
20%
04
Sell Proposition
15%
05
Build Brand

Beat the Skip carries the heaviest weight because in feed environments, TikTok, Reels, Shorts, nothing else matters if the viewer is gone in two seconds. Get Noticed and Brand Impact share the middle weight band. Build Brand sits at 15%, not because long-term equity is unimportant, but because for a single ad it is the slowest signal to move.

02, How the composite is computed

The composite is a weighted sum, not an average. The difference matters when one KPI is dragging.

composite = (BeatTheSkip × 0.25) + (GetNoticed × 0.20) + (BrandImpact × 0.20) + (SellProp × 0.20) + (BuildBrand × 0.15)

For our worked example, Beat the Skip 70, Get Noticed 65, Brand Impact 45, Sell Proposition 68, Build Brand 60, the math is:

= (70 × 0.25) + (65 × 0.20) + (45 × 0.20) + (68 × 0.20) + (60 × 0.15) = 17.5 + 13.0 + 9.0 + 13.6 + 9.0 = 62.1 → rounds to 62.
"The composite is a weighted sum, not an average. A 45 on Brand Impact does not disappear just because the rest is fine, it bleeds nine points out of a possible twenty."

03, The three verdicts

Three states. One ladder. No half-verdicts.

Scale

composite ≥ 70 AND no KPI < 55

Strong skip resilience, strong impact, no hidden weakness. Push spend, build variants, do not over-engineer.

Sharpen

composite 55–69

Real strengths, one or two weak spots. The decision trace tells you which KPI to fix and roughly where in the creative.

Rebuild

composite < 55 OR two+ KPIs < 45

Structural problems. Editing won't save it. Start over from the brief, not from the cut.

The double rule on Scale, composite ≥ 70 and no KPI below 55, exists to stop a single weak KPI from being averaged away by four strong ones. The double rule on Rebuild does the same on the other side: two or more KPIs under 45 is a Rebuild even if the composite scrapes 56.

04, Confidence labels

Every KPI score comes with a confidence label. It is not a number you can multiply by anything, it is a flag that tells you how much to trust the verdict before you act on it.

Three inputs shape confidence:

  • Benchmark density. The score is a percentile against the benchmark pool. If your ad's category has 500+ ads in the pool, confidence is high. If it has 40, the label says so. We never silently downgrade, the trace exposes the cohort size.
  • Replicate agreement. The model is run multiple times. If the KPI score drifts more than a small band across replicates, confidence drops.
  • Attribute detection certainty. Whether the model is sure the brand logo appeared at 1.2s vs guessing. Detection is roughly 85% accurate, the formal LLM-as-judge inter-rater reliability paper is in pipeline.

05, Reading the decision trace

Every analysis stores a Zod-validated record. This is what makes a RoastIQ verdict auditable instead of vibes-based. A trimmed example:

{ "run_id": "rq_2026_05_18_a4f1", "model_version": "gemini-2.5-flash@2026-04-22", "benchmark_pool_version": "pool_v7_2026_05_05", "composite": 62, "verdict": "sharpen", "matrix_label": "missed_opportunity", "kpis": { "beat_the_skip": { "score": 70, "confidence": "high", "evidence": ["frame_0.4s","frame_1.2s"] }, "get_noticed": { "score": 65, "confidence": "high", "evidence": ["frame_2.1s"] }, "brand_impact": { "score": 45, "confidence": "medium", "evidence": ["logo_first_seen_at_6.8s"] }, "sell_proposition":{ "score": 68, "confidence": "high", "evidence": ["transcript_3.2s"] }, "build_brand": { "score": 60, "confidence": "medium", "evidence": ["closing_card_8.4s"] } }, "recommendations": [ "Move first distinctive brand asset before 2.0s", "Extend hero shot of product by ~0.4s", "Consider second logo appearance at end card" ], "confidence_labels": { "overall": "high" } }

The fields you cannot skip: model_version (so you can reproduce the run), benchmark_pool_version (so the percentile makes sense), evidence (frame and transcript pointers, not vibes), and recommendations (specific enough to act on).

06, Worked example: composite 62

A 9:16 brand ad for a mid-market beauty client. Eight seconds. Product-led, with a brand logo arriving late at 6.8s. Here is the full KPI table the report returned.

KPIWeightScoreContributionConfidenceBand
Beat the Skip25%7017.5highStrong
Get Noticed20%6513.0highModerate
Brand Impact20%459.0mediumModerate
Sell Proposition20%6813.6highModerate
Build Brand15%609.0mediumModerate
Composite100%6262.1highSharpen

The verdict is Sharpen. The composite is in the 55–69 band, no KPI is under 45, so it does not trip the Rebuild double rule. The weak spot is unambiguous, Brand Impact at 45, and the trace pinpoints why: the first distinctive brand asset does not appear until 6.8s of an 8-second cut. The recommendation is mechanical: move the asset before 2.0s.

"A score of 62 is not a grade. It is an address. It tells you which floor to walk to."

07, Three reviewers, three reads

Before the trace existed, this exact ad went through an internal review. Three people. Three verdicts. Same eight seconds.

Account Director
"Ship it."

Saw a clean hook, a clear product shot, no obvious craft issue. Pattern-matched to ads that performed last quarter. Did not notice the logo timing.

Creative Director
"Rebuild."

Bothered by the late brand entry and the flat closing card. Felt the whole spot lacked a distinctive idea. Wanted to go back to brief.

Strategist
"Sharpen."

Argued the hook and proposition were earning their weight. Suspected the brand asset placement was the issue, but had no way to prove it inside the meeting.

The decision trace did not pick a winner by personality. It picked one by evidence: KPI scores, benchmark percentiles, a frame-level pointer to the 6.8s logo entry, and a recommendation that any editor could execute in twenty minutes. The Strategist was right. The other two were not wrong about what they noticed, they were guessing about what to do next.

08, The honest caveat

RoastIQ scores are model predictions, not measurements. We validate them against public engagement and click-intent signals only, TikTok engagement, TikTok CTR, YouTube view counts. As of 2026-05-05, held-out out-of-sample Spearman ρ is between +0.30 and +0.32 across those outcomes. Pool-wide quintile lift is 6.5×: ads in the top predicted quintile outperform the bottom quintile by that factor on the public outcome signal.

What this guide does not claim:

  • That a 62 predicts ROAS, attributed sales, or in-market business outcomes.
  • That a 62 predicts brand recall the way a survey would. Different construct.
  • That Meta-Feed scores have the same validation depth, for brand ads on Meta, outcome data is sparse and the scores are directional defaults.

What it does claim, plainly: a Sharpen verdict with a Brand Impact score of 45 and a frame-level pointer to a late logo is more decision-useful than three smart people disagreeing in a meeting.

Keep reading

Frequently asked

What does a composite score of 62 actually mean?

Composite 62 falls inside the Sharpen band (55–69). It means the ad has real strengths but one or two KPIs are pulling the verdict down. The decision trace tells you which KPI and where in the creative to act.

Can the KPI weights ever change?

No. The 5 KPI names and their weights, Beat the Skip 25, Get Noticed 20, Brand Impact 20, Sell Proposition 20, Build Brand 15, are frozen by product rule. Changing them would invalidate historical comparisons and benchmark percentiles.

Why are scores predictions and not measurements?

RoastIQ is a model. We validate predictions against public engagement and click-intent signals from TikTok and YouTube, held-out OOS Spearman ρ of +0.30 to +0.32 as of 2026-05-05. We never claim to predict ROAS, attributed sales, or brand recall.

How do I know the verdict is trustworthy?

Every analysis stores a decision trace: model_version, benchmark_pool_version, KPI scores, evidence pointers, recommendations. Confidence labels reflect benchmark density, replicate agreement, and attribute detection certainty. You can audit any verdict against its evidence.