Methodology · updated 18 May 2026 · 9 min read

What "benchmarked"
really means.

A score with no peer is just a number. This is how the SaliencyLab benchmark is actually built, 2,047 ads, 22 metadata fields per ad, five public sources, immutable pool versions, zero scraping. And the line we draw against any vendor with a bigger, flatter pool.

2,047 ads in pool 22 metadata fields 43 attributes per ad 5 public sources v.2026-05 pinned

Read the defense → See the IRR paper

Oussama Nakhil · Founder & CEO

Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights, where I managed Kantar as a vendor and learned the unit economics of legacy testing from the inside. Every claim on this page is auditable.

21,500

Data points across the 500-ad attribute core

6.5×

Top quintile vs bottom, pool-wide lift

Scraped ads in the benchmark

01 · The argument

A score with no peer is just a number.

Every creative-testing vendor in the market sells you a number between 0 and 100. The question almost no buyer thinks to ask, and the one that decides whether the number is worth paying for, is: compared to what?

"Your ad scored 67" is meaningless unless you know that the median ad in the same category, on the same platform, in the same language, scored 51 and the top decile scored 78. The peer set is what turns the number from a feeling into a decision. The peer set is the product. Everything else, the model, the LLM, the saliency map, is infrastructure that produces a number the benchmark can then make legible.

This page is the answer to the procurement question that comes after the demo: "how is the benchmark actually built, and why should I believe it represents my problem?" The honest answer is detailed and slightly tedious; the dishonest answer is a marketing slide. We have chosen the first one.

Operating principle

"The score is infrastructure. The benchmark is the product. If you cannot defend the peer set in a room, you cannot defend the verdict."

02 · Depth vs width

Depth beats width. Always.

There is a temptation, especially with VC pitches, to brag about raw pool size. "We have 100,000 ads in our benchmark." This sounds impressive until you ask the next question: how many attributes per ad are tagged at the level needed to discriminate between Sharpen and Rebuild?

The decision a buyer needs is not "is this ad good in general." It is "which specific KPI is the one I should re-edit before I spend €40,000 in media." That requires per-ad attribute density, hook timing class, brand cue placement, on-screen text density, distinctive-asset cadence, sound-off comprehension, pacing band, dominant emotion, ritual-resonance markers, CTA placement, at minimum. Sparse pools collapse to a coarser decision surface.

Our core attribute set is 43 tags per ad on the 500-ad sub-cohort we use for Paper 04's IRR analysis. That gives us 21,500 data points across the attribute core, more than enough to discriminate between adjacent verdict bands. A 5,000-ad pool with 8 attributes per ad has 40,000 datapoints, but they cannot tell you which KPI to fix.

// Depth vs width, illustrative

Datapoint density (ads × attributes) versus decision granularity. More datapoints do not equal better decisions if attributes are too coarse to discriminate.

SaliencyLab (500 × 43)

21,500

Legacy vendor A (5,000 × 8)

40,000*

Legacy vendor B (15,000 × 4)

60,000*

Survey panel (200 × 12)

2,400

*Asterisked figures are illustrative of public vendor disclosures; the orange bars are dimmed to mark that the headline datapoint count does not translate into per-cut decision specificity at the level RoastIQ delivers.

03 · Sources

Five sources. All public.

Every ad in the benchmark comes from one of five sources. Four are official transparency tools or public APIs run by the platforms themselves. The fifth is editorial curation against documented provenance. We do not scrape.

Source	Type	What we ingest	Ads
Meta Ad Library API	Official public API	Brand-ad cohort, advertiser metadata, ad format	~440
TikTok Creative Center	Public top-ads surface	CTR percentile bands, view counts, engagement signals	~700
TikTok Ad Library	Transparency surface	All-ads cohort, regional + language coverage	~310
Google Ads Transparency Center	Public disclosure	YouTube In-Stream + Shorts, advertiser disclosure	~403
Manual editorial curation	Documented provenance	Real cuts, scored through the live pipeline	~194

The line "we do not scrape" is the line that matters in any procurement or legal conversation. Scraping breaks platform terms of service and creates downstream risk for the customer the benchmark serves. Transparency tools exist precisely because the platforms have decided what is legal to access programmatically. We use what is allowed; we do not use what is not. The audit log per ad, source URL, capture date, licence class, is detailed in the companion data-sourcing defense.

04 · Schema

Twenty-two fields, per ad.

Every ad in the pool carries 22 metadata fields. The schema is what makes the benchmark queryable in the right way, slice by category, by language, by region, by format, by sampling band, and gives every report its peer set rather than the pool's grand average.

Platform · market · language · category (20-class taxonomy) · subcategory · brand · advertiser, the slicing fields that let a beauty ad in French be benchmarked against French beauty ads, not against US auto ads.
Ad format · duration · dominant attribute set · distinctive-asset density, format-level structure used to control for cohort drift when reporting percentiles.
Public engagement signal · CTR percentile band · capture date · source URL, the outcome signals that ground OOS validation, plus the provenance trail.
model_version · benchmark_pool_version · confidence_label · scoring_run_id, the four pins that make every report fully reproducible months after it was generated.
Attribute-detection accuracy class · sampling band · opt-in flag, the honesty layer: every ad knows how confident the pipeline was in tagging it, where in the sampling distribution it sat, and whether it entered via opt-in or public surface.

The 22-field schema is the reason a Pro user gets a benchmark band that says "78th percentile in Beauty · MENA · TikTok · 9–15s" instead of "78th percentile in the pool." Pool-wide percentile is a vanity number; cohort-controlled percentile is a decision.

05 · Versioning

Immutable pools, auditable drift.

Every benchmark pool is an immutable snapshot, pinned by version. The current version is v.2026-05. New ads enter the next version, never the current one. Every report stores the benchmark_pool_version it was scored against.

This is the single most underrated piece of infrastructure in the whole product. Without immutable versioning, a "12 percentile improvement" between a January report and a May report might just be the pool shifting underneath you, your ad did not get better, the peer set got worse (or better, or different). With immutable versioning, the percentile delta is real, because the comparison is against a fixed snapshot.

Pool version	Frozen on	Ads	Held-out ρ	Quintile lift
v.2025-10	2025-10-15	200	+0.09	2.1×
v.2025-12	2025-12-12	500	+0.16	3.0×
v.2026-02	2026-02-04	900	+0.23	4.2×
v.2026-03	2026-03-08	1,400	+0.27	5.1×
v.2026-04	2026-04-09	1,800	+0.30	6.0×
v.2026-05	2026-05-05	2,047	+0.31	6.5×

The table above is what an "auditable scaling curve" looks like in practice, every version pinned, every headline metric recomputed and published, no silent re-versioning of live reports. A vendor who cannot show you this table is a vendor whose percentile numbers cannot be trusted across time.

House rule

"Pools are immutable. Reports pin their pool. Drift is auditable, never silent. If the score moves between reports, the reason is in the model_version or the benchmark_pool_version, never in unannounced re-tagging."

06 · Competitive defense

Defending against a bigger pool.

A meaningful share of procurement conversations include a sentence like "another vendor has 50,000 ads in their benchmark." Here is the three-question response that ends that conversation.

Question one: publish your scaling curve. If their cohort grew from 1,000 to 50,000 ads but held-out ρ went from +0.28 to +0.29, the model is memorising the cohort, not learning the construct. Our curve climbed from +0.09 to +0.31 across a 10× cohort expansion. Slope is honesty; flat is not.

Question two: how many attributes per ad are tagged at decision-discriminating depth? If the answer is "we tag 6 attributes" or "we score on a single composite," the pool cannot tell you which KPI to fix. It can only tell you the ad is bad. Our 43-attribute core on the 500-ad sub-cohort exists precisely to support per-KPI diagnosis.

Question three: is the pool immutable per version, or does it shift under live reports? If the vendor cannot point to a pinned benchmark_pool_version on every report, percentile movement between reports is not interpretable. Our pool versions are pinned and the version table is on this page.

The line

"Volume is a marketing slide. The scaling curve, the attribute density, and the immutable versioning are the only three numbers that matter."

07 · Open data

CC-BY-NC. On purpose.

Paper 05 in the SaliencyLab research pipeline (target: Scientific Data) is the dataset paper. The benchmark will be released under Creative Commons BY-NC, attribution required, non-commercial use only without licence. The validation code lives on GitHub.

The choice of CC-BY-NC over a fully open licence is deliberate. We want academic researchers, marketing-science PhD students, and journal reviewers to be able to re-run our headline numbers without our cooperation. We do not want a competitor wholesale-licensing the dataset to back-fill their own benchmark and rebranding it. Non-commercial is the line that lets us be honest with academia and rigorous with commerce in the same gesture.

The paper publishes the dataset, the schema, the sampling methodology, the licence terms, and the validation code in one citable artifact. After that release, "show me your benchmark" stops being a marketing question for SaliencyLab, it is a citation.

Dataset. 2,047 ads, 22 metadata fields, 43 attributes on the 500-ad core, scoring outputs per pool version, all OOS validation splits.
Code. GitHub repository with the validation pipeline (Spearman ρ computation, quintile-lift calculation, scaling-curve regeneration) and reproducible notebooks.
Documentation. Per-ad provenance log, the 22-field schema, the licence text, the citation format, the change log between pool versions.

A skeptical researcher should be able to read the paper, clone the repo, point it at the dataset, and reproduce the +0.31 OOS ρ without sending us a single email. That is what "open methodology" actually means.

Questions we keep getting

Asked, plainly.

How big does a creative-testing benchmark actually need to be?

Big enough that your scaling curve is not flat. The honest answer is not "10,000" or "50,000", it is "enough that adding 200 more ads still moves your held-out OOS Spearman ρ by a measurable amount." Our cohort grew from 200 to 2,047 and ρ climbed from +0.09 to +0.31. We are not at saturation. A vendor whose curve is flat at a higher absolute number is memorising; a vendor whose curve is still climbing is learning.

Why is depth (43 attributes per ad) more important than width (more ads)?

Because the decision a buyer needs is not "is this ad good in general", it is "which of the five KPIs is the one I should re-edit." That requires per-ad attribute tagging dense enough to discriminate between adjacent verdicts. 500 ads × 43 attributes = 21,500 datapoints. A 5,000-ad pool with only 8 attributes per ad has 40,000 datapoints, but they collapse to a coarser decision surface.

What counts as a "public source"? Where do the 2,047 ads actually come from?

Five sources, all public. Meta Ad Library API (~440 ads). TikTok Creative Center (~700, public top-ads data). TikTok Ad Library (~310, transparency surface). Google Ads Transparency Center (~403, YouTube). Manual editorial curation (~194, real cuts scored through the live pipeline with documented provenance). Customer uploads enter only with explicit opt-in via the Score-My-Ad route.

What is benchmark_pool_version and why does it matter?

Every benchmark pool is an immutable snapshot, pinned by version (current: v.2026-05). New ads enter the next version, never the current one. Every report stores the benchmark_pool_version it was scored against, so a January report and a May report compare to the same fixed peer set. Without immutable versioning, a "12 percentile improvement" might just be the pool shifting underneath you.

You plan to release the dataset under CC-BY-NC. What does that mean?

Paper 05 in our research pipeline (target: Scientific Data) is the dataset paper. CC-BY-NC means anyone can use the benchmark for research and educational purposes with attribution; commercial use requires a licence. The validation code will be on GitHub. The point is that a skeptical researcher should be able to re-run our headline numbers without our cooperation, that is what "reproducible" means in a discipline that has been allergic to reproducibility for forty years.

How do I respond when a vendor says they have 100,000 ads in their benchmark?

Ask three questions. One: what is your scaling curve, does ρ still climb with cohort size, or did it plateau? Two: how many attributes per ad are tagged at the level that lets you discriminate Sharpen from Rebuild on a single cut? Three: is your pool immutable per version, or does it shift under live reports? A vendor who cannot answer those three has volume, not credibility. We can answer all three on this page.

A score with no peer is just a number.

Depth beats width. Always.

// Depth vs width, illustrative

Five sources. All public.

Twenty-two fields, per ad.

Immutable pools, auditable drift.

Defending against a bigger pool.

CC-BY-NC. On purpose.

Companion research.

Asked, plainly.