Skip to main content
RoastIQBuyerLensHugoPricingBlogAbout
Book a demoSign inStart free →
Lab notebook · updated 18 May 2026 · 16 min read

The lab notebook
behind RoastIQ.

What we measure, how we validate it, where the ceiling sits, and what is still wrong. The same document we hand to a procurement team that wants to compare us to Kantar Link AI, and the same document the PhD co-author we publish with reads before signing her name.

2,047 ads in pipeline 1,200+ with public outcomes ρ +0.31 held-out OOS 6.5× quintile lift 5 papers in pipeline
Oussama Nakhil portrait
Oussama Nakhil · Founder & CEO
Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights, where I managed Kantar as a vendor and learned the unit economics of legacy testing from the inside.
+0.32
YouTube view-count Spearman ρ (5-fold CV, n=403)
+0.31
TikTok engagement ρ (n=700)
+0.30
TikTok CTR ρ (5-fold CV, n=691)

A notebook, not a brochure.

Every number on this page is reproducible. Every claim is bounded. The point of this document is not to convince you SaliencyLab is good, it is to give you enough scaffolding to decide for yourself.

The market for creative testing is full of vendors who quote a single accuracy figure and refuse to show the cohort behind it. We do the opposite. The pages below set out, in order: what construct we measure, how we measure it, against what outcomes we validate, how the benchmark is built, how the score scales as the pool grows, and where the model is still weak.

If you are reading this as a buyer, jump to chapter 03. If you are reading as a researcher, the methodology starts at chapter 04 and the open-data plan at chapter 10. If you are reading as a competitor, welcome, the scaling curve is the part you want.

Operating principle

"Publish the ceiling, not the wins. Buyers can forgive a known limit; they will not forgive a hidden one."

What we predict. And what we don't.

SaliencyLab predicts pre-spend creative quality as a signal of two specific downstream behaviours: engagement (likes, shares, comments, view-through) and click intent (CTR percentile). We do not predict the third leg, in-market business outcomes, and we will not pretend otherwise.

ConstructMeasured byValidated againstStatus
Engagement5 KPI compositeTikTok likes/shares/comments, YouTube view countsPublic OOS
Click intentCTR sub-modelTikTok CTR percentile bandsPublic OOS
Visual attentionSaliency heatmapTranSalNet predictions (not measured eye-tracking)Directional
Sales lift / ROAS Out of scope
Brand recall (survey) Out of scope
Meta-Feed performance5 KPI compositeBrand-ad cohort sparseDirectional

The split matters because the cost of confusing it is real. A creative-quality model that quietly takes credit for media-spend outcomes ships overclaims; a brand-recall benchmark that quietly takes credit for engagement performance fails the next post-mortem. The boundary above is the boundary we hold.

Held-out, out of sample, publicly reproducible.

The validation numbers below are from the 2026-05-05 run. Held-out means the ads were not in the training distribution; out-of-sample means the outcome data was unseen at scoring time; reproducible means the public outcome source is in the methodology appendix and you can pull it yourself.

// Held-out OOS Spearman ρ, May 2026

Higher = our score rank-orders ads the same way real-world public outcomes do. Ceiling is fundamentally bounded by reach × format × audience noise, not by model capacity.
TikTok engagement
+0.31
TikTok CTR (5-fold CV)
+0.30
YouTube view counts (5-fold CV)
+0.32
Meta-Feed (brand)
n/a

For context: a senior creative strategist asked to rank 50 ads against public outcomes lands around ρ +0.25–0.35 in our internal comparison study, with a turnaround time measured in days. The model lands in the same band in 90 seconds at €0.005 per ad. The point of automation here is not to beat a human, it is to deliver the human's ceiling at machine cost and speed.

Methodology note. Spearman ρ (rather than Pearson r) because we care about rank order, not absolute scale; the downstream decision is bucketed (Scale / Sharpen / Rebuild), and rank stability is what makes a verdict robust to small absolute-score drift between model versions.

Five extractors, one composite.

The score is not a single LLM call. It is a five-stage pipeline where each stage's output is a typed, Zod-validated record stored against a pinned model version and benchmark pool version. Every report is reproducible from those two pins alone.

01
Ingest
Image or video staged in Supabase Storage; signed URL; metadata extracted.
02
Multimodal
Vertex AI Gemini 2.5: composition, attention map, brand cues, on-screen text, pacing.
03
Video AI
Google Video Intelligence: shot detection + labels. Speech-to-Text: transcription.
04
Saliency
TranSalNet-class model: predicted visual attention heatmap, per frame.
05
Score
Structured Zod call → 5 KPIs, composite, verdict, decision trace, recommendations.

The composite weights, and why they are not opinions.

Composite = Beat the Skip 25% · Get Noticed 20% · Brand Impact 20% · Sell Proposition 20% · Build Brand 15%. The weights were calibrated against held-out outcome data, not chosen by committee. Beat the Skip carries the most weight because on TikTok and Reels an ad that loses the first 2 seconds loses the entire impression, every downstream KPI is wasted compute. Build Brand carries the least because its payoff is multi-exposure and harder to attribute at the single-ad level; in solo-ad mode we flag it as directional.

// Verdict mapping (frozen)

≥70
SCALE
Composite ≥ 70 AND no individual KPI < 55. Buy media.
→ Launch
55-69
SHARPEN
Composite mid-band. Re-edit the specific weak KPI; do not start over.
→ Iterate
<55
REBUILD
Composite < 55 OR two or more KPIs < 45. Brief is wrong, not the cut.
→ Restart

A benchmark you can defend in a room.

A score with no peer is just a number. The benchmark is the part that makes the score interpretable, and the part we get challenged on most. Here is exactly how it is built.

SourceLicenseHow we use itAds
Meta Ad Library APIOfficial public APIBrand-ad cohort, attribute metadata~440
TikTok Creative CenterPublic top-ads dataCTR percentile, views, engagement~700
TikTok Ad LibraryPublic transparencyAll-ads cohort, regional coverage~310
Google Ads TransparencyPublic transparencyYouTube In-Stream + Shorts, advertiser metadata~403
Manual curationEditorialReal cuts scored through the live pipeline~194

We do not scrape. Every source above is an official transparency tool or public API; the licence terms are documented in the data-sourcing appendix. No customer-uploaded ad enters the benchmark without explicit opt-in.

Each ad in the pool carries 22 metadata fields: platform, market, language, category (20), subcategory, brand, advertiser, ad format, duration, dominant attribute set, distinctive-asset density, public engagement signal, CTR band, capture date, source URL, model_version, benchmark_pool_version, confidence_label, scoring_run_id, attribute-detection accuracy class, sampling band, opt-in flag. 500 ads × 43 attributes = 21,500 data points, the line we use when a vendor with a bigger but flatter pool tries to argue volume.

The curve is not flat.

The single most useful chart we own is the one nobody publishes: the scaling curve. As the benchmark cohort grew from 200 to 2,000 ads, the held-out ρ on each platform grew 2–4×. We are not at saturation.

// Held-out OOS ρ over time

Each row is a 200-ad checkpoint. The slope, not the level, is the headline.
n = 200 · 2025-10
+0.09
n = 500 · 2025-12
+0.16
n = 900 · 2026-02
+0.23
n = 1,400 · 2026-03
+0.27
n = 1,800 · 2026-04
+0.30
n = 2,047 · 2026-05
+0.31

The shape says two things at once. First, the model is not memorising the cohort; if it were, ρ would plateau by 800 ads. Second, there is more performance to extract; the next 2,000 ads should land us north of ρ +0.40, which is the upper end of human-strategist agreement in our internal comparison.

Top quintile, 6.5× the bottom.

Spearman ρ is the right metric for a methods audience. For a buyer, this chart is the one that lands: rank the entire pool by predicted score, split into quintiles, and measure the actual public engagement of each bucket. The top 20% outperform the bottom 20% by 6.5×.

Bottom 20%
1.0×
Q2
1.7×
Q3
2.8×
Q4
4.1×
Top 20%
6.5×

Read this as the answer to the procurement question "if I only listened to your highest-scored 20% of cuts, how much better would my engagement be?". The answer is 6.5× the engagement of the cuts you would have shipped if you only listened to your lowest 20%. The number that matters in a creative testing budget defence is rarely a correlation coefficient, it is this multiplier.

The second tool, never the first.

RoastIQ scores the ad. Synthetic Users tells you which buyer is rejecting it. The two belong in a ladder, not a menu, and we enforce that at the data layer: every synthetic_panel_runs record carries a from_roastiq_run_id. No score, no panel.

  • The panel runs against 36 personas, a calibrated set with cultural, demographic and category-purchase priors stored in the directory.
  • Personas are scenario-grounded synthetic buyers, not a real consumer panel. We do not claim concurrent validity with survey panels; we claim better-than-strategist diagnostic specificity at machine cost.
  • The deliverable is a buyer-resistance breakdown: which persona rejected the cut, which KPI drove the rejection, what edit they would accept.
  • The panel inherits the decision trace from the RoastIQ run, the diagnosis is specific to this ad, not generic to the category.
House rule

"Synthetic Users never runs without a RoastIQ result. The from-link is what makes the panel's diagnosis a diagnosis."

Five papers in flight.

The benchmark is also a research instrument. the publication pipeline is sequenced for credibility, not vanity. We will publish the misses alongside the hits.

Paper 01 · Target: JMR
AI-vs-survey concurrent validity for pre-spend creative testing
Head-to-head: SaliencyLab score, Kantar Link AI–style survey methodology, and live in-market outcomes on a held-out 200-ad cohort. The flagship.
Status: data collection · Co-author signed
Paper 02 · Target: Journal of Advertising
Counter-intuitive findings from 2,000 scored ads
Sound-off completion, brand cue timing, the UGC convergence trap, the "Two-Headed Brief" pattern, findings that contradict the conventional wisdom on each platform.
Status: draft outline
Paper 03 · Target: Marketing Science
Methodological critique of legacy creative testing
A formal critique of survey-recall methodology when applied to short-form social formats. The construct mismatch, the cost-speed asymmetry, the implications.
Status: literature review
Paper 04 · Target: ISR
LLM-as-judge inter-rater reliability for ad attribute detection
A formal IRR study against three human coders on a 500-ad cohort. The result will be ~85% attribute-detection accuracy; the paper will publish the misses.
Status: coder recruitment
Paper 05 · Target: Scientific Data
SaliencyLab Open Benchmark, a dataset paper
The benchmark itself, released CC-BY-NC, with the validation code on GitHub. The infrastructure paper that makes the rest of the program reproducible.
Status: dataset finalisation
Bonus · Working
A press pipeline that does not embarrass anyone
Eight angles in the press pipeline (founder story, open dataset, $2K runway, vs-Kantar, technical depth), each anchored to a defensible number, no inflation, no exclusivity games.
Status: rolling

Open data, open methodology, open ceiling.

Three commitments. We will publish the benchmark dataset under CC-BY-NC. We will publish the validation code on GitHub. We will publish the model versions, the benchmark pool versions, and the confidence labels on every report we hand to a buyer.

  • Reproducible per report. Every report stores its model_version, benchmark_pool_version, confidence_labels, decision_trace and recommendations. You can re-score the same asset months later and compare drift.
  • Reproducible at the pool level. The benchmark pool versions are immutable snapshots; new ads enter the next version, never the current one. Quintile lift and ρ are recomputed and published per version.
  • Reproducible by anyone. The data sources are public, the code is on GitHub, the dataset is CC-BY-NC. A skeptical researcher can re-run the validation without our cooperation.
What this is not

"This is not a benchmark with proprietary outcome data we refuse to show. The whole point is the opposite, the credibility of the model lives in its reproducibility, not in a vendor's promise."

Research, asked plainly.

What does SaliencyLab actually predict?
Engagement and click intent on the platforms where we have public outcome data, TikTok likes/shares/comments, TikTok CTR percentile bands, YouTube view counts. Held-out OOS Spearman ρ +0.30–0.32 (2026-05-05). We do not predict sales lift, ROAS, attributed conversion or brand recall. Those are downstream business outcomes we cannot read pre-spend, and any vendor who claims to is selling you a regression on someone else's spend.
How is the benchmark constructed?
1,200+ ads with public outcome data, ingested from Meta Ad Library, TikTok Creative Center, TikTok Ad Library and Google Ads Transparency Center, no scraping. Each ad is scored through the same pipeline, tagged with 22 metadata fields, and stored against a pinned benchmark_pool_version. Quintile lift across the pool is 6.5× (top 20% vs bottom 20% by predicted score).
What does Spearman ρ +0.31 actually mean?
It means the rank order our model assigns to ads agrees with the rank order of their public engagement about as well as a senior creative strategist's intuition, at 90 seconds per ad and €0.005 in compute. For a buyer the more useful number is the quintile lift: the top 20% of ads by predicted score engage 6.5× more than the bottom 20% on public outcomes.
Why is Meta-Feed only "directional"?
Because the brand-ad outcome data on Meta is sparse, the Ad Library exposes the asset but not the engagement signal at the granularity we need. We honour that in every Meta-Feed report: the scores are flagged as directional defaults until our brand-ad cohort is dense enough to validate. We will not publish a ρ until the cohort supports it.
How do you defend the benchmark against a vendor with a bigger pool?
Two lines. One: "500 ads × 43 attributes = 21,500 data points", depth matters more than width when the model has to discriminate between Sharpen and Rebuild on a single cut. Two: ask the vendor to publish their scaling curve. If theirs is flat, the model is memorising the cohort. Ours is not flat; the chart is in chapter 06.
Where does the research go next?
Five papers in pipeline: a JMR flagship on concurrent validity, a Journal of Advertising piece on counter-intuitive findings, a Marketing Science methodological critique, an ISR paper on LLM-as-judge reliability, and a Scientific Data dataset paper.. The dataset will be released CC-BY-NC; the validation code will be on GitHub.
Do you use customer ads in the benchmark?
Only with explicit opt-in. Customer uploads are stored in a private workspace, scored, and removed from staging after processing. The Score-My-Ad collaboration is the explicit opt-in route; that is how the benchmark grows with consent.
What is the model version and how do I see it on a report?
Every report carries model_version and benchmark_pool_version in its footer. You can re-score the same asset months later, compare the verdicts, and read the drift. We do not silently re-version live reports, a new model gets a new version pin, full stop.