What "benchmarked" really means

01, Foundations

A score without a peer is just a number.

An ad scoring 62 means nothing until someone in the room asks the only question that matters: 62 vs what? Sixty-two against a pool of 30 random ads from your category is a coin flip wearing a uniform. Sixty-two against 500 ads in the same platform, the same category, the same media format, that's a position. A position is a decision input. A number is decoration.

This is the part of creative testing that gets quietly skipped in most vendor decks. They show you the score. They do not show you the pool. They do not show you the cohort definition. They do not show you how often the pool is refreshed, who curates it, or how a new ad gets in. The pool is the product. The model is just a function that ranks against it.

"The score is half the answer. The pool is the other half. If a vendor won't show you their pool, they're not selling you a benchmark, they're selling you a vibe."

02, Framing

The credibility pitch line, depth beats volume.

The line we use in rooms with skeptical buyers is this: our pool is 2,047 ads × 43 attributes, which is roughly 88,000 structured data points. Survey-volume competitors will tell you they have "10 million completed interviews." That is a real number. It is also the wrong unit. An interview is a single binary response to a question. A scored attribute is a continuous signal connected to 42 other signals on the same creative.

The original framing we used in our deck was "500 ads × 43 attributes = 21,500 data points." That was true in Q1. As of this guide it understates the pool by a factor of four. Depth, attribute count per creative, is the dimension that matters when you are trying to diagnose a creative rather than count opinions about it.

"Survey vendors count interviews. We count structured attributes per ad. Different unit. Different product. Different decision speed."

03, Sizing

How big does a benchmark actually need to be?

There is no universal answer. There is a use-case answer. If you want a holistic creative diagnosis (is this ad in the top third of its category?) you can get a defensible read on a smaller cohort. If you want KPI-level precision, Beat the Skip vs Get Noticed split, you need more. If you want calibration-quality out-of-sample validation that survives a Codex audit, you need real volume.

Use case	Minimum n per cohort	What it lets you say	Confidence
Holistic creative diagnosis	≥ 50	"Top third / middle / bottom third of category"	Directional
KPI-level precision	≥ 200	"Beat the Skip is a 71, strong against category"	Working
OOS calibration	≥ 400	"Held-out Spearman ρ +0.30, p < 0.001"	Defensible
Sub-segment claims (geo, format, brand size)	≥ 100 each	"In Meta-Feed brand ads under €500k spend, this scores…"	Defensible
Cross-platform transfer claims	Don't make them yet	"Our TikTok model predicts YouTube", no, it doesn't, separately calibrate	Off-limits

A benchmark that brags about size without telling you how the size breaks down by cohort is misleading by omission. Ten million interviews across 200 countries and 60 categories collapses to ~800 interviews per cell, and that is before you slice by media format, brand size, or year. Size at the top is not size where the decision happens.

04, Provenance

Where the ads come from.

Every ad in the SaliencyLab pool comes from an official public transparency source. We do not scrape. We have never scraped. The legal defensibility of a benchmark does not come from anonymization after the fact, it comes from being able to point at the source and the API and the terms of service the data was published under.

Meta Ad Library API, official Meta endpoint, returns live and archived political + commercial ads with media URLs, advertiser identity, spend bands, impressions bands.
TikTok Creative Center, TikTok's own "top ads" portal, returns ranked-by-performance creatives with CTR percentile, view counts, engagement.
TikTok Ad Library, separate from Creative Center, returns the full set of running ads, not just top-ranked.
Google Ads Transparency Center, official Google transparency endpoint, returns YouTube + Display + Search ads with advertiser verification.
Manual curation, for categories where APIs are thin (luxury, Moroccan FMCG, regional banks), the curator pulls public ads into Drive and the pipeline scores them. Sourced publicly. Logged. Auditable.

This matters in two rooms. The first is the legal review at an enterprise prospect, they will ask, on contract, where the comparator data was sourced. The second is the journalist who eventually writes the "AI ad scoring is just scraped data" article. We have an answer to both, and the answer is: official APIs, public transparency tools, no T&C violation, no PII, no advertiser proprietary data.

05, Validation

How we validate, and what the numbers actually say.

Validation is the part that separates a benchmark from a claim. Our held-out out-of-sample numbers as of 2026-05-05:

Outcome	Platform	Held-out n	Spearman ρ	Status
Engagement (likes + shares + comments)	TikTok	700	+0.31	Validated
Click intent (CTR percentile band)	TikTok	691 · 5-fold CV	+0.30	Validated
View counts	YouTube	403 · 5-fold CV	+0.32	Validated
Engagement	Meta-Feed			Directional only
Pool-wide quintile lift (top vs bottom 20%)	All	1,200+	6.5×	Validated

What that means in plain language: an ad in the top quintile of our predicted score outperforms a bottom-quintile ad on real public engagement by 6.5× on average. The Spearman correlation sits in the +0.30 to +0.32 band, modest in absolute terms, but in the range where Kantar Link AI's own published concurrent-validity chart lives, and meaningfully above zero with held-out evaluation. The scaling curves in docs/assets/validation/ show ρ growing 2-4× as the cohort expanded. We are not at saturation. The next 1,000 ads should push these numbers further before they plateau.

06, Defense

Defending a benchmark in a room of buyers.

Four questions show up in every serious meeting. Have the answer ready in one sentence each.

"Is this enough ads?", "Two thousand and forty-seven across ten categories, with 1,200+ carrying public outcome labels for validation. For your category specifically, n is X, let me show you the cohort." Never give the top-line number alone.
"Is my category represented?", Pull up the category breakdown. If the cell is thin, say so. "Food & Beverage is 88 ads, strong. Luxury is 22, directional only. We're actively curating Luxury through Q3." Honest beats impressive.
"What's your validation?", "Held-out Spearman ρ +0.30 to +0.32 across TikTok engagement, TikTok CTR, and YouTube views. Meta is directional. We do not claim to predict sales or ROAS, different construct."
"What about my market?", If they're asking about a market you have not validated, the answer is the next sentence: "We have not validated against your market specifically. We can run a pilot. Or you can decide based on the engagement and click-intent signals we have validated globally." Do not improvise a number.

"Buyers don't punish you for honest gaps. They punish you for confident gaps. Say what the number is. Say what cohort it's based on. Say where it stops."

07, Honest caveats

What we don't claim, and won't.

Meta-Feed scores are directional defaults. The outcome data for brand ads on Meta-Feed is sparse in our current cohort. We have no held-out validation for Meta-Feed. Honor this in any pitch, say it before they ask.
We do not predict sales, ROAS, attributed conversion, or brand recall. Engagement and click intent are leading indicators. They are correlated with downstream outcomes. They are not the same construct.
Attribute detection is ~85% accurate, not 100%. A formal inter-rater reliability paper (LLM-as-judge methodology) is in our pipeline, it is not yet published.
Scaling curves are still growing. The 2-4× ρ improvement we have seen as the cohort expanded suggests we are not at saturation. The next batch should improve, not collapse, the validation numbers.
Synthetic Users is scenario simulation, not a panel. The benchmark is for the score. Synthetic Users explains buyer resistance against that score. They are different surfaces.

The construct difference matters most. Live-survey methodologies measure stated recall, stated preference, stated intent, they have decades of methodology and they are the gold standard for what they measure. SaliencyLab measures behavioral proxies: likes, shares, comments, view counts, click-through percentile bands. Different construct, different unit, different decision tempo. Not a replacement. A complement, used earlier in the creative process, when the cost of being wrong is still low.

Keep reading.

This guide sits inside the Creative Analysis hub. The other five guides build on it.

Guide · 7 min

Hook rate vs thumbstop, the first 2 seconds, measured.

Guide · 8 min

Brand cue placement, when the logo helps and when it kills attention.

Guide · 10 min

Synthetic vs live test, what each one is for, what each one isn't.

Guide · 6 min

Reading a RoastIQ score without lying to yourself.

Guide · 7 min

The first two seconds, what survives the skip.

Hub index

All six guides in one place.

Frequently asked.

How many ads should be in a benchmark before I trust it?

Use-case dependent. At least 50 per cohort for directional creative diagnosis, 200+ for KPI-level precision, 400+ for out-of-sample calibration claims. Our active pool is 2,047 ads across ten categories.

Do you scrape Meta and TikTok to build the benchmark?

No. We use Meta Ad Library API, TikTok Creative Center, TikTok Ad Library, and Google Ads Transparency Center, all official public transparency tools. Legal defensibility comes from sourcing, not anonymization.

What does your validation actually show?

Held-out Spearman ρ +0.31 on TikTok engagement (n=700), +0.30 on TikTok CTR (n=691, 5-fold CV), +0.32 on YouTube view counts (n=403, 5-fold CV). Pool-wide quintile lift 6.5×. Meta-Feed is directional defaults only, outcome data is sparse for brand ads.

Can SaliencyLab predict sales or ROAS?

No. We predict engagement and click intent, validated against public outcomes only. Sales, ROAS, attributed conversion, and brand recall are different constructs, measured by different instruments, on different timescales.