Lab notebook · updated 18 May 2026 · 16 min read

The lab notebook
behind RoastIQ.

What we measure, how we validate it, where the ceiling sits, and what is still wrong. The same document we hand to a procurement team that wants to compare us to Kantar Link AI, and the same document the PhD co-author we publish with reads before signing her name.

2,047 ads in pipeline 1,200+ with public outcomes ρ +0.31 held-out OOS 6.5× quintile lift 5 papers in pipeline

Open the notebook → Run the pipeline yourself

Oussama Nakhil · Founder & CEO

Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights, where I managed Kantar as a vendor and learned the unit economics of legacy testing from the inside.

+0.32

YouTube view-count Spearman ρ (5-fold CV, n=403)

+0.31

TikTok engagement ρ (n=700)

+0.30

TikTok CTR ρ (5-fold CV, n=691)

01 · Orientation

A notebook, not a brochure.

Every number on this page is reproducible. Every claim is bounded. The point of this document is not to convince you SaliencyLab is good, it is to give you enough scaffolding to decide for yourself.

The market for creative testing is full of vendors who quote a single accuracy figure and refuse to show the cohort behind it. We do the opposite. The pages below set out, in order: what construct we measure, how we measure it, against what outcomes we validate, how the benchmark is built, how the score scales as the pool grows, and where the model is still weak.

If you are reading this as a buyer, jump to chapter 03. If you are reading as a researcher, the methodology starts at chapter 04 and the open-data plan at chapter 10. If you are reading as a competitor, welcome, the scaling curve is the part you want.

Operating principle

"Publish the ceiling, not the wins. Buyers can forgive a known limit; they will not forgive a hidden one."

02 · Construct validity

What we predict. And what we don't.

SaliencyLab predicts pre-spend creative quality as a signal of two specific downstream behaviours: engagement (likes, shares, comments, view-through) and click intent (CTR percentile). We do not predict the third leg, in-market business outcomes, and we will not pretend otherwise.

Construct	Measured by	Validated against	Status
Engagement	5 KPI composite	TikTok likes/shares/comments, YouTube view counts	Public OOS
Click intent	CTR sub-model	TikTok CTR percentile bands	Public OOS
Visual attention	Saliency heatmap	TranSalNet predictions (not measured eye-tracking)	Directional
Sales lift / ROAS			Out of scope
Brand recall (survey)			Out of scope
Meta-Feed performance	5 KPI composite	Brand-ad cohort sparse	Directional

The split matters because the cost of confusing it is real. A creative-quality model that quietly takes credit for media-spend outcomes ships overclaims; a brand-recall benchmark that quietly takes credit for engagement performance fails the next post-mortem. The boundary above is the boundary we hold.

03 · Validation

Held-out, out of sample, publicly reproducible.

The validation numbers below are from the 2026-05-05 run. Held-out means the ads were not in the training distribution; out-of-sample means the outcome data was unseen at scoring time; reproducible means the public outcome source is in the methodology appendix and you can pull it yourself.

// Held-out OOS Spearman ρ, May 2026

Higher = our score rank-orders ads the same way real-world public outcomes do. Ceiling is fundamentally bounded by reach × format × audience noise, not by model capacity.

TikTok engagement

+0.31

TikTok CTR (5-fold CV)

+0.30

YouTube view counts (5-fold CV)

+0.32

Meta-Feed (brand)

n/a

For context: a senior creative strategist asked to rank 50 ads against public outcomes lands around ρ +0.25–0.35 in our internal comparison study, with a turnaround time measured in days. The model lands in the same band in 90 seconds at €0.005 per ad. The point of automation here is not to beat a human, it is to deliver the human's ceiling at machine cost and speed.

Methodology note. Spearman ρ (rather than Pearson r) because we care about rank order, not absolute scale; the downstream decision is bucketed (Scale / Sharpen / Rebuild), and rank stability is what makes a verdict robust to small absolute-score drift between model versions.

04 · Scoring layers

Five extractors, one composite.

The score is not a single LLM call. It is a five-stage pipeline where each stage's output is a typed, Zod-validated record stored against a pinned model version and benchmark pool version. Every report is reproducible from those two pins alone.

Ingest

Image or video staged in Supabase Storage; signed URL; metadata extracted.

Multimodal

Vertex AI Gemini 2.5: composition, attention map, brand cues, on-screen text, pacing.

Video AI

Google Video Intelligence: shot detection + labels. Speech-to-Text: transcription.

Saliency

TranSalNet-class model: predicted visual attention heatmap, per frame.

Score

Structured Zod call → 5 KPIs, composite, verdict, decision trace, recommendations.

The composite weights, and why they are not opinions.

Composite = Beat the Skip 25% · Get Noticed 20% · Brand Impact 20% · Sell Proposition 20% · Build Brand 15%. The weights were calibrated against held-out outcome data, not chosen by committee. Beat the Skip carries the most weight because on TikTok and Reels an ad that loses the first 2 seconds loses the entire impression, every downstream KPI is wasted compute. Build Brand carries the least because its payoff is multi-exposure and harder to attribute at the single-ad level; in solo-ad mode we flag it as directional.

// Verdict mapping (frozen)

≥70

SCALE

Composite ≥ 70 AND no individual KPI < 55. Buy media.

→ Launch

55-69

SHARPEN

Composite mid-band. Re-edit the specific weak KPI; do not start over.

→ Iterate

<55

REBUILD

Composite < 55 OR two or more KPIs < 45. Brief is wrong, not the cut.

→ Restart

05 · Benchmark

A benchmark you can defend in a room.

A score with no peer is just a number. The benchmark is the part that makes the score interpretable, and the part we get challenged on most. Here is exactly how it is built.

Source	License	How we use it	Ads
Meta Ad Library API	Official public API	Brand-ad cohort, attribute metadata	~440
TikTok Creative Center	Public top-ads data	CTR percentile, views, engagement	~700
TikTok Ad Library	Public transparency	All-ads cohort, regional coverage	~310
Google Ads Transparency	Public transparency	YouTube In-Stream + Shorts, advertiser metadata	~403
Manual curation	Editorial	Real cuts scored through the live pipeline	~194

We do not scrape. Every source above is an official transparency tool or public API; the licence terms are documented in the data-sourcing appendix. No customer-uploaded ad enters the benchmark without explicit opt-in.

Each ad in the pool carries 22 metadata fields: platform, market, language, category (20), subcategory, brand, advertiser, ad format, duration, dominant attribute set, distinctive-asset density, public engagement signal, CTR band, capture date, source URL, model_version, benchmark_pool_version, confidence_label, scoring_run_id, attribute-detection accuracy class, sampling band, opt-in flag. 500 ads × 43 attributes = 21,500 data points, the line we use when a vendor with a bigger but flatter pool tries to argue volume.

06 · Scaling

The curve is not flat.

The single most useful chart we own is the one nobody publishes: the scaling curve. As the benchmark cohort grew from 200 to 2,000 ads, the held-out ρ on each platform grew 2–4×. We are not at saturation.

// Held-out OOS ρ over time

Each row is a 200-ad checkpoint. The slope, not the level, is the headline.

n = 200 · 2025-10

+0.09

n = 500 · 2025-12

+0.16

n = 900 · 2026-02

+0.23

n = 1,400 · 2026-03

+0.27

n = 1,800 · 2026-04

+0.30

n = 2,047 · 2026-05

+0.31

The shape says two things at once. First, the model is not memorising the cohort; if it were, ρ would plateau by 800 ads. Second, there is more performance to extract; the next 2,000 ads should land us north of ρ +0.40, which is the upper end of human-strategist agreement in our internal comparison.

07 · Quintile lift

Top quintile, 6.5× the bottom.

Spearman ρ is the right metric for a methods audience. For a buyer, this chart is the one that lands: rank the entire pool by predicted score, split into quintiles, and measure the actual public engagement of each bucket. The top 20% outperform the bottom 20% by 6.5×.

Bottom 20%

1.0×

1.7×

2.8×

4.1×

Top 20%

6.5×

Read this as the answer to the procurement question "if I only listened to your highest-scored 20% of cuts, how much better would my engagement be?". The answer is 6.5× the engagement of the cuts you would have shipped if you only listened to your lowest 20%. The number that matters in a creative testing budget defence is rarely a correlation coefficient, it is this multiplier.

08 · Synthetic Users

The second tool, never the first.

RoastIQ scores the ad. Synthetic Users tells you which buyer is rejecting it. The two belong in a ladder, not a menu, and we enforce that at the data layer: every synthetic_panel_runs record carries a from_roastiq_run_id. No score, no panel.

The panel runs against 36 personas, a calibrated set with cultural, demographic and category-purchase priors stored in the directory.
Personas are scenario-grounded synthetic buyers, not a real consumer panel. We do not claim concurrent validity with survey panels; we claim better-than-strategist diagnostic specificity at machine cost.
The deliverable is a buyer-resistance breakdown: which persona rejected the cut, which KPI drove the rejection, what edit they would accept.
The panel inherits the decision trace from the RoastIQ run, the diagnosis is specific to this ad, not generic to the category.

House rule

"Synthetic Users never runs without a RoastIQ result. The from-link is what makes the panel's diagnosis a diagnosis."

09 · Research pipeline

Five papers in flight.

The benchmark is also a research instrument. the publication pipeline is sequenced for credibility, not vanity. We will publish the misses alongside the hits.

Paper 01 · Target: JMR

AI-vs-survey concurrent validity for pre-spend creative testing

Head-to-head: SaliencyLab score, Kantar Link AI–style survey methodology, and live in-market outcomes on a held-out 200-ad cohort. The flagship.

Status: data collection · Co-author signed

Paper 02 · Target: Journal of Advertising

Counter-intuitive findings from 2,000 scored ads

Sound-off completion, brand cue timing, the UGC convergence trap, the "Two-Headed Brief" pattern, findings that contradict the conventional wisdom on each platform.

Status: draft outline

Paper 03 · Target: Marketing Science

Methodological critique of legacy creative testing

A formal critique of survey-recall methodology when applied to short-form social formats. The construct mismatch, the cost-speed asymmetry, the implications.

Status: literature review

Paper 04 · Target: ISR

LLM-as-judge inter-rater reliability for ad attribute detection

A formal IRR study against three human coders on a 500-ad cohort. The result will be ~85% attribute-detection accuracy; the paper will publish the misses.

Status: coder recruitment

Paper 05 · Target: Scientific Data

SaliencyLab Open Benchmark, a dataset paper

The benchmark itself, released CC-BY-NC, with the validation code on GitHub. The infrastructure paper that makes the rest of the program reproducible.

Status: dataset finalisation

Bonus · Working

A press pipeline that does not embarrass anyone

Eight angles in the press pipeline (founder story, open dataset, $2K runway, vs-Kantar, technical depth), each anchored to a defensible number, no inflation, no exclusivity games.

Status: rolling

10 · Open methodology

Open data, open methodology, open ceiling.

Three commitments. We will publish the benchmark dataset under CC-BY-NC. We will publish the validation code on GitHub. We will publish the model versions, the benchmark pool versions, and the confidence labels on every report we hand to a buyer.

Reproducible per report. Every report stores its model_version, benchmark_pool_version, confidence_labels, decision_trace and recommendations. You can re-score the same asset months later and compare drift.
Reproducible at the pool level. The benchmark pool versions are immutable snapshots; new ads enter the next version, never the current one. Quintile lift and ρ are recomputed and published per version.
Reproducible by anyone. The data sources are public, the code is on GitHub, the dataset is CC-BY-NC. A skeptical researcher can re-run the validation without our cooperation.

What this is not

"This is not a benchmark with proprietary outcome data we refuse to show. The whole point is the opposite, the credibility of the model lives in its reproducibility, not in a vendor's promise."

Questions we keep getting

Research, asked plainly.

What does SaliencyLab actually predict?

Engagement and click intent on the platforms where we have public outcome data, TikTok likes/shares/comments, TikTok CTR percentile bands, YouTube view counts. Held-out OOS Spearman ρ +0.30–0.32 (2026-05-05). We do not predict sales lift, ROAS, attributed conversion or brand recall. Those are downstream business outcomes we cannot read pre-spend, and any vendor who claims to is selling you a regression on someone else's spend.

How is the benchmark constructed?

1,200+ ads with public outcome data, ingested from Meta Ad Library, TikTok Creative Center, TikTok Ad Library and Google Ads Transparency Center, no scraping. Each ad is scored through the same pipeline, tagged with 22 metadata fields, and stored against a pinned benchmark_pool_version. Quintile lift across the pool is 6.5× (top 20% vs bottom 20% by predicted score).

What does Spearman ρ +0.31 actually mean?

It means the rank order our model assigns to ads agrees with the rank order of their public engagement about as well as a senior creative strategist's intuition, at 90 seconds per ad and €0.005 in compute. For a buyer the more useful number is the quintile lift: the top 20% of ads by predicted score engage 6.5× more than the bottom 20% on public outcomes.

Why is Meta-Feed only "directional"?

Because the brand-ad outcome data on Meta is sparse, the Ad Library exposes the asset but not the engagement signal at the granularity we need. We honour that in every Meta-Feed report: the scores are flagged as directional defaults until our brand-ad cohort is dense enough to validate. We will not publish a ρ until the cohort supports it.

How do you defend the benchmark against a vendor with a bigger pool?

Two lines. One: "500 ads × 43 attributes = 21,500 data points", depth matters more than width when the model has to discriminate between Sharpen and Rebuild on a single cut. Two: ask the vendor to publish their scaling curve. If theirs is flat, the model is memorising the cohort. Ours is not flat; the chart is in chapter 06.

Where does the research go next?

Five papers in pipeline: a JMR flagship on concurrent validity, a Journal of Advertising piece on counter-intuitive findings, a Marketing Science methodological critique, an ISR paper on LLM-as-judge reliability, and a Scientific Data dataset paper.. The dataset will be released CC-BY-NC; the validation code will be on GitHub.

Do you use customer ads in the benchmark?

Only with explicit opt-in. Customer uploads are stored in a private workspace, scored, and removed from staging after processing. The Score-My-Ad collaboration is the explicit opt-in route; that is how the benchmark grows with consent.

What is the model version and how do I see it on a report?

Every report carries model_version and benchmark_pool_version in its footer. You can re-score the same asset months later, compare the verdicts, and read the drift. We do not silently re-version live reports, a new model gets a new version pin, full stop.

A notebook, not a brochure.

What we predict. And what we don't.

Held-out, out of sample, publicly reproducible.

// Held-out OOS Spearman ρ, May 2026

Five extractors, one composite.

The composite weights, and why they are not opinions.

// Verdict mapping (frozen)

A benchmark you can defend in a room.

The curve is not flat.

// Held-out OOS ρ over time

Top quintile, 6.5× the bottom.

The second tool, never the first.

Five papers in flight.

Open data, open methodology, open ceiling.

Companion research.

Research, asked plainly.