A score with no peer is just a number.
Every creative-testing vendor in the market sells you a number between 0 and 100. The question almost no buyer thinks to ask, and the one that decides whether the number is worth paying for, is: compared to what?
"Your ad scored 67" is meaningless unless you know that the median ad in the same category, on the same platform, in the same language, scored 51 and the top decile scored 78. The peer set is what turns the number from a feeling into a decision. The peer set is the product. Everything else, the model, the LLM, the saliency map, is infrastructure that produces a number the benchmark can then make legible.
This page is the answer to the procurement question that comes after the demo: "how is the benchmark actually built, and why should I believe it represents my problem?" The honest answer is detailed and slightly tedious; the dishonest answer is a marketing slide. We have chosen the first one.
"The score is infrastructure. The benchmark is the product. If you cannot defend the peer set in a room, you cannot defend the verdict."
Depth beats width. Always.
There is a temptation, especially with VC pitches, to brag about raw pool size. "We have 100,000 ads in our benchmark." This sounds impressive until you ask the next question: how many attributes per ad are tagged at the level needed to discriminate between Sharpen and Rebuild?
The decision a buyer needs is not "is this ad good in general." It is "which specific KPI is the one I should re-edit before I spend €40,000 in media." That requires per-ad attribute density, hook timing class, brand cue placement, on-screen text density, distinctive-asset cadence, sound-off comprehension, pacing band, dominant emotion, ritual-resonance markers, CTA placement, at minimum. Sparse pools collapse to a coarser decision surface.
Our core attribute set is 43 tags per ad on the 500-ad sub-cohort we use for Paper 04's IRR analysis. That gives us 21,500 data points across the attribute core, more than enough to discriminate between adjacent verdict bands. A 5,000-ad pool with 8 attributes per ad has 40,000 datapoints, but they cannot tell you which KPI to fix.
// Depth vs width, illustrative
*Asterisked figures are illustrative of public vendor disclosures; the orange bars are dimmed to mark that the headline datapoint count does not translate into per-cut decision specificity at the level RoastIQ delivers.
Five sources. All public.
Every ad in the benchmark comes from one of five sources. Four are official transparency tools or public APIs run by the platforms themselves. The fifth is editorial curation against documented provenance. We do not scrape.
| Source | Type | What we ingest | Ads |
|---|---|---|---|
| Meta Ad Library API | Official public API | Brand-ad cohort, advertiser metadata, ad format | ~440 |
| TikTok Creative Center | Public top-ads surface | CTR percentile bands, view counts, engagement signals | ~700 |
| TikTok Ad Library | Transparency surface | All-ads cohort, regional + language coverage | ~310 |
| Google Ads Transparency Center | Public disclosure | YouTube In-Stream + Shorts, advertiser disclosure | ~403 |
| Manual editorial curation | Documented provenance | Real cuts, scored through the live pipeline | ~194 |
The line "we do not scrape" is the line that matters in any procurement or legal conversation. Scraping breaks platform terms of service and creates downstream risk for the customer the benchmark serves. Transparency tools exist precisely because the platforms have decided what is legal to access programmatically. We use what is allowed; we do not use what is not. The audit log per ad, source URL, capture date, licence class, is detailed in the companion data-sourcing defense.
Twenty-two fields, per ad.
Every ad in the pool carries 22 metadata fields. The schema is what makes the benchmark queryable in the right way, slice by category, by language, by region, by format, by sampling band, and gives every report its peer set rather than the pool's grand average.
- Platform · market · language · category (20-class taxonomy) · subcategory · brand · advertiser, the slicing fields that let a beauty ad in French be benchmarked against French beauty ads, not against US auto ads.
- Ad format · duration · dominant attribute set · distinctive-asset density, format-level structure used to control for cohort drift when reporting percentiles.
- Public engagement signal · CTR percentile band · capture date · source URL, the outcome signals that ground OOS validation, plus the provenance trail.
- model_version · benchmark_pool_version · confidence_label · scoring_run_id, the four pins that make every report fully reproducible months after it was generated.
- Attribute-detection accuracy class · sampling band · opt-in flag, the honesty layer: every ad knows how confident the pipeline was in tagging it, where in the sampling distribution it sat, and whether it entered via opt-in or public surface.
The 22-field schema is the reason a Pro user gets a benchmark band that says "78th percentile in Beauty · MENA · TikTok · 9–15s" instead of "78th percentile in the pool." Pool-wide percentile is a vanity number; cohort-controlled percentile is a decision.
Immutable pools, auditable drift.
Every benchmark pool is an immutable snapshot, pinned by version. The current version is v.2026-05. New ads enter the next version, never the current one. Every report stores the benchmark_pool_version it was scored against.
This is the single most underrated piece of infrastructure in the whole product. Without immutable versioning, a "12 percentile improvement" between a January report and a May report might just be the pool shifting underneath you, your ad did not get better, the peer set got worse (or better, or different). With immutable versioning, the percentile delta is real, because the comparison is against a fixed snapshot.
| Pool version | Frozen on | Ads | Held-out ρ | Quintile lift |
|---|---|---|---|---|
| v.2025-10 | 2025-10-15 | 200 | +0.09 | 2.1× |
| v.2025-12 | 2025-12-12 | 500 | +0.16 | 3.0× |
| v.2026-02 | 2026-02-04 | 900 | +0.23 | 4.2× |
| v.2026-03 | 2026-03-08 | 1,400 | +0.27 | 5.1× |
| v.2026-04 | 2026-04-09 | 1,800 | +0.30 | 6.0× |
| v.2026-05 | 2026-05-05 | 2,047 | +0.31 | 6.5× |
The table above is what an "auditable scaling curve" looks like in practice, every version pinned, every headline metric recomputed and published, no silent re-versioning of live reports. A vendor who cannot show you this table is a vendor whose percentile numbers cannot be trusted across time.
"Pools are immutable. Reports pin their pool. Drift is auditable, never silent. If the score moves between reports, the reason is in the model_version or the benchmark_pool_version, never in unannounced re-tagging."
Defending against a bigger pool.
A meaningful share of procurement conversations include a sentence like "another vendor has 50,000 ads in their benchmark." Here is the three-question response that ends that conversation.
Question one: publish your scaling curve. If their cohort grew from 1,000 to 50,000 ads but held-out ρ went from +0.28 to +0.29, the model is memorising the cohort, not learning the construct. Our curve climbed from +0.09 to +0.31 across a 10× cohort expansion. Slope is honesty; flat is not.
Question two: how many attributes per ad are tagged at decision-discriminating depth? If the answer is "we tag 6 attributes" or "we score on a single composite," the pool cannot tell you which KPI to fix. It can only tell you the ad is bad. Our 43-attribute core on the 500-ad sub-cohort exists precisely to support per-KPI diagnosis.
Question three: is the pool immutable per version, or does it shift under live reports? If the vendor cannot point to a pinned benchmark_pool_version on every report, percentile movement between reports is not interpretable. Our pool versions are pinned and the version table is on this page.
"Volume is a marketing slide. The scaling curve, the attribute density, and the immutable versioning are the only three numbers that matter."
CC-BY-NC. On purpose.
Paper 05 in the SaliencyLab research pipeline (target: Scientific Data) is the dataset paper. The benchmark will be released under Creative Commons BY-NC, attribution required, non-commercial use only without licence. The validation code lives on GitHub.
The choice of CC-BY-NC over a fully open licence is deliberate. We want academic researchers, marketing-science PhD students, and journal reviewers to be able to re-run our headline numbers without our cooperation. We do not want a competitor wholesale-licensing the dataset to back-fill their own benchmark and rebranding it. Non-commercial is the line that lets us be honest with academia and rigorous with commerce in the same gesture.
The paper publishes the dataset, the schema, the sampling methodology, the licence terms, and the validation code in one citable artifact. After that release, "show me your benchmark" stops being a marketing question for SaliencyLab, it is a citation.
- Dataset. 2,047 ads, 22 metadata fields, 43 attributes on the 500-ad core, scoring outputs per pool version, all OOS validation splits.
- Code. GitHub repository with the validation pipeline (Spearman ρ computation, quintile-lift calculation, scaling-curve regeneration) and reproducible notebooks.
- Documentation. Per-ad provenance log, the 22-field schema, the licence text, the citation format, the change log between pool versions.
A skeptical researcher should be able to read the paper, clone the repo, point it at the dataset, and reproduce the +0.31 OOS ρ without sending us a single email. That is what "open methodology" actually means.