A notebook, not a brochure.
Every number on this page is reproducible. Every claim is bounded. The point of this document is not to convince you SaliencyLab is good, it is to give you enough scaffolding to decide for yourself.
The market for creative testing is full of vendors who quote a single accuracy figure and refuse to show the cohort behind it. We do the opposite. The pages below set out, in order: what construct we measure, how we measure it, against what outcomes we validate, how the benchmark is built, how the score scales as the pool grows, and where the model is still weak.
If you are reading this as a buyer, jump to chapter 03. If you are reading as a researcher, the methodology starts at chapter 04 and the open-data plan at chapter 10. If you are reading as a competitor, welcome, the scaling curve is the part you want.
"Publish the ceiling, not the wins. Buyers can forgive a known limit; they will not forgive a hidden one."
What we predict. And what we don't.
SaliencyLab predicts pre-spend creative quality as a signal of two specific downstream behaviours: engagement (likes, shares, comments, view-through) and click intent (CTR percentile). We do not predict the third leg, in-market business outcomes, and we will not pretend otherwise.
| Construct | Measured by | Validated against | Status |
|---|---|---|---|
| Engagement | 5 KPI composite | TikTok likes/shares/comments, YouTube view counts | Public OOS |
| Click intent | CTR sub-model | TikTok CTR percentile bands | Public OOS |
| Visual attention | Saliency heatmap | TranSalNet predictions (not measured eye-tracking) | Directional |
| Sales lift / ROAS | Out of scope | ||
| Brand recall (survey) | Out of scope | ||
| Meta-Feed performance | 5 KPI composite | Brand-ad cohort sparse | Directional |
The split matters because the cost of confusing it is real. A creative-quality model that quietly takes credit for media-spend outcomes ships overclaims; a brand-recall benchmark that quietly takes credit for engagement performance fails the next post-mortem. The boundary above is the boundary we hold.
Held-out, out of sample, publicly reproducible.
The validation numbers below are from the 2026-05-05 run. Held-out means the ads were not in the training distribution; out-of-sample means the outcome data was unseen at scoring time; reproducible means the public outcome source is in the methodology appendix and you can pull it yourself.
// Held-out OOS Spearman ρ, May 2026
For context: a senior creative strategist asked to rank 50 ads against public outcomes lands around ρ +0.25–0.35 in our internal comparison study, with a turnaround time measured in days. The model lands in the same band in 90 seconds at €0.005 per ad. The point of automation here is not to beat a human, it is to deliver the human's ceiling at machine cost and speed.
Methodology note. Spearman ρ (rather than Pearson r) because we care about rank order, not absolute scale; the downstream decision is bucketed (Scale / Sharpen / Rebuild), and rank stability is what makes a verdict robust to small absolute-score drift between model versions.
Five extractors, one composite.
The score is not a single LLM call. It is a five-stage pipeline where each stage's output is a typed, Zod-validated record stored against a pinned model version and benchmark pool version. Every report is reproducible from those two pins alone.
The composite weights, and why they are not opinions.
Composite = Beat the Skip 25% · Get Noticed 20% · Brand Impact 20% · Sell Proposition 20% · Build Brand 15%. The weights were calibrated against held-out outcome data, not chosen by committee. Beat the Skip carries the most weight because on TikTok and Reels an ad that loses the first 2 seconds loses the entire impression, every downstream KPI is wasted compute. Build Brand carries the least because its payoff is multi-exposure and harder to attribute at the single-ad level; in solo-ad mode we flag it as directional.
// Verdict mapping (frozen)
A benchmark you can defend in a room.
A score with no peer is just a number. The benchmark is the part that makes the score interpretable, and the part we get challenged on most. Here is exactly how it is built.
| Source | License | How we use it | Ads |
|---|---|---|---|
| Meta Ad Library API | Official public API | Brand-ad cohort, attribute metadata | ~440 |
| TikTok Creative Center | Public top-ads data | CTR percentile, views, engagement | ~700 |
| TikTok Ad Library | Public transparency | All-ads cohort, regional coverage | ~310 |
| Google Ads Transparency | Public transparency | YouTube In-Stream + Shorts, advertiser metadata | ~403 |
| Manual curation | Editorial | Real cuts scored through the live pipeline | ~194 |
We do not scrape. Every source above is an official transparency tool or public API; the licence terms are documented in the data-sourcing appendix. No customer-uploaded ad enters the benchmark without explicit opt-in.
Each ad in the pool carries 22 metadata fields: platform, market, language, category (20), subcategory, brand, advertiser, ad format, duration, dominant attribute set, distinctive-asset density, public engagement signal, CTR band, capture date, source URL, model_version, benchmark_pool_version, confidence_label, scoring_run_id, attribute-detection accuracy class, sampling band, opt-in flag. 500 ads × 43 attributes = 21,500 data points, the line we use when a vendor with a bigger but flatter pool tries to argue volume.
The curve is not flat.
The single most useful chart we own is the one nobody publishes: the scaling curve. As the benchmark cohort grew from 200 to 2,000 ads, the held-out ρ on each platform grew 2–4×. We are not at saturation.
// Held-out OOS ρ over time
The shape says two things at once. First, the model is not memorising the cohort; if it were, ρ would plateau by 800 ads. Second, there is more performance to extract; the next 2,000 ads should land us north of ρ +0.40, which is the upper end of human-strategist agreement in our internal comparison.
Top quintile, 6.5× the bottom.
Spearman ρ is the right metric for a methods audience. For a buyer, this chart is the one that lands: rank the entire pool by predicted score, split into quintiles, and measure the actual public engagement of each bucket. The top 20% outperform the bottom 20% by 6.5×.
Read this as the answer to the procurement question "if I only listened to your highest-scored 20% of cuts, how much better would my engagement be?". The answer is 6.5× the engagement of the cuts you would have shipped if you only listened to your lowest 20%. The number that matters in a creative testing budget defence is rarely a correlation coefficient, it is this multiplier.
The second tool, never the first.
RoastIQ scores the ad. Synthetic Users tells you which buyer is rejecting it. The two belong in a ladder, not a menu, and we enforce that at the data layer: every synthetic_panel_runs record carries a from_roastiq_run_id. No score, no panel.
- The panel runs against 36 personas, a calibrated set with cultural, demographic and category-purchase priors stored in the directory.
- Personas are scenario-grounded synthetic buyers, not a real consumer panel. We do not claim concurrent validity with survey panels; we claim better-than-strategist diagnostic specificity at machine cost.
- The deliverable is a buyer-resistance breakdown: which persona rejected the cut, which KPI drove the rejection, what edit they would accept.
- The panel inherits the decision trace from the RoastIQ run, the diagnosis is specific to this ad, not generic to the category.
"Synthetic Users never runs without a RoastIQ result. The from-link is what makes the panel's diagnosis a diagnosis."
Five papers in flight.
The benchmark is also a research instrument. the publication pipeline is sequenced for credibility, not vanity. We will publish the misses alongside the hits.
Open data, open methodology, open ceiling.
Three commitments. We will publish the benchmark dataset under CC-BY-NC. We will publish the validation code on GitHub. We will publish the model versions, the benchmark pool versions, and the confidence labels on every report we hand to a buyer.
- Reproducible per report. Every report stores its model_version, benchmark_pool_version, confidence_labels, decision_trace and recommendations. You can re-score the same asset months later and compare drift.
- Reproducible at the pool level. The benchmark pool versions are immutable snapshots; new ads enter the next version, never the current one. Quintile lift and ρ are recomputed and published per version.
- Reproducible by anyone. The data sources are public, the code is on GitHub, the dataset is CC-BY-NC. A skeptical researcher can re-run the validation without our cooperation.
"This is not a benchmark with proprietary outcome data we refuse to show. The whole point is the opposite, the credibility of the model lives in its reproducibility, not in a vendor's promise."