Methodology · published 18 May 2026 · 7 min read

We do not scrape.
Here is the proof.

Every ad in the SaliencyLab benchmark pool comes from an official transparency surface or from manual editorial curation with documented provenance. Per-ad audit log, licence-class breakdown, customer opt-in pathway, GDPR and CCPA posture, all in one page, all auditable on request.

5 public sources 2,047 ads in pool 0 scraped 22 metadata fields per ad v.2026-05 pinned

Read the defense → See the benchmark defense

Oussama Nakhil · Founder & CEO

Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights, where vendor procurement, data-licensing terms, and brand-safety audits were part of every quarter. This page is written from the seat of the buyer who used to ask these questions.

Scraped ads in the pool

Metadata fields stored per ad

Default data residency, Frankfurt + EU regions

01 · Why this matters

The question after the demo.

Almost every procurement conversation after a strong product demo arrives at the same question, usually phrased politely: "how exactly did you build the benchmark, and is anything in it going to embarrass us six months from now?"

The reason the question matters is not academic. A creative-testing benchmark sits inside a marketing-data supply chain that already includes a media agency, a creative agency, a measurement vendor, and a brand-safety vendor. A new tool in that stack inherits the legal posture of its weakest input. If a vendor scraped a platform to populate its benchmark, the customer who scores their ads against that benchmark inherits the platform's potential action against the vendor. Procurement teams know this. So do the legal teams that sign off on the SOW.

The honest defense is not "trust us." The honest defense is a per-row audit log. Every ad in pool v.2026-05 carries a source_url, a capture_date, a licence_class, an opt_in_flag, and a benchmark_pool_version. That is the row a procurement team should be able to ask to see, for one ad, for ten ads, for any ad they care about. We can produce it.

The procurement test

"If a vendor cannot point at one ad in their pool and show you the source URL, the capture date, and the licence class, the benchmark is not a benchmark. It is a marketing slide with a number on it."

02 · The sources

Five sources. All licensed.

Every ad in the 2,047-row pool enters through one of five pathways. Four are official platform transparency surfaces. The fifth is editorial curation with documented provenance from public, opted-in, or commissioned material. We argue every pathway from its licence terms, not from convenience.

Source	Licence class	What we use it for	What we never use it for	Ads
Meta Ad Library API	Official API	Brand-ad cohort, advertiser metadata, ad format, creative URL via documented endpoints	HTML-scraping the Ad Library web UI, bypassing rate limits, joining to non-public user data	~440
TikTok Creative Center	Public transparency	Top-ads cohort, CTR percentile bands, view counts, engagement signals for outcome calibration	Reposting creator videos outside the transparency frame, monetising the assets, joining to creator-private data	~700
TikTok Ad Library	Public transparency	Broader all-ads cohort, regional and language coverage, transparency-obligation surface	Mass enumeration outside the published surface, scraping logged-out content not exposed by the library	~310
Google Ads Transparency Center	Public disclosure	YouTube In-Stream + Shorts cohort, advertiser disclosure metadata, ad format and region	Joining to Google Ads account data, re-uploading the asset for commercial republishing	~403
Manual editorial curation	Documented provenance	Real cuts scored through the live pipeline, with per-ad provenance log captured at the moment of ingestion	Anonymous uploads with no source, customer assets without opt-in flag, anything failing the provenance gate	~194

Notice the structure of the table. Each row has a "what we never use it for" column, that is the column that matters in a procurement conversation. A vendor who only publishes what they ingest has no observable line; a vendor who publishes both sides of the line has an auditable practice.

The four platform surfaces exist because regulators told the platforms to publish them. Meta launched the Ad Library in response to political-ad transparency obligations; TikTok and Google followed for the same reason. Using the transparency surfaces is the legally intended pathway, they were built so a third party could programmatically inspect what ads are running. That is what we do. Nothing more.

03 · One ad, full audit trail

A single row, fully traced.

An abstract claim is a marketing line. A worked example is an audit. Below is the actual metadata structure that ships with one ad in pool v.2026-05, a Magalu (Brazil) e-commerce cut from Q1 2026, sourced via TikTok Creative Center. This is the row a procurement team can ask to see.

// benchmark_pool.row · v.2026-05 · row_id 01HQ9V…
{
  "ad_id": "slbm_2026_05_01492",
  "brand": "Magazine Luiza (Magalu)",
  "advertiser_id": "tt_creative_center_magalu_br",
  "platform": "tiktok",
  "market": "BR",
  "language": "pt-BR",
  "category": "retail · e-commerce",
  "format": "vertical 9:16 · 22s",
  "source": "tiktok_creative_center",
  "source_url": "https://ads.tiktok.com/business/creativecenter/topads/ ↩
                 br/pc/en?period=30®ion=BR&industry=retail",
  "licence_class": "public_transparency",
  "opt_in_flag": false,
  "capture_date": "2026-04-22T09:14:00Z",
  "capture_method": "public_endpoint_documented",
  "public_engagement_signal": "ctr_pct_band_top10",
  "benchmark_pool_version": "v.2026-05",
  "model_version": "gemini-2.5-flash · scoring-v3.1",
  "scoring_run_id": "run_01HRD3A1Q2C…",
  "confidence_label": "high",
  "sampling_band": "BR · retail · q1-2026",
  "residency": "eu-frankfurt"
}

Every field in that row is queryable. A buyer can ask us to filter the pool to "all BR retail ads captured between March and May 2026 with confidence_label = high" and we can return the list with all provenance fields attached. The row above is one of 2,047. The procurement defense is not "the pool is clean", it is "every row is traceable to a public URL captured at a recorded moment under a stated licence class." Those are different sentences. The second one is auditable.

Why the source_url is non-negotiable

A common shortcut in legacy benchmark construction is to strip the source URL once the asset is ingested, to keep the pool "clean" of platform-specific metadata. We do the opposite. The source URL is kept, indexed, and queryable. When a customer wants to verify provenance on a specific ad, the URL is the chain of custody. When a platform changes its transparency-surface structure (Meta and TikTok have both done so in the last 18 months), the URL is the migration anchor. Stripping it is an unforced error.

04 · The opt-in pathway

Customer ads. Explicit opt-in only.

The fifth pathway into the benchmark, and the only one that involves a customer's own creative, is the Score-My-Ad opt-in. It is the route by which the benchmark grows from real, live ads beyond what the public transparency surfaces cover. It is also the route most likely to be misunderstood by a buyer who has worked with a vendor that quietly ingested everything.

The default for any customer upload is private. The ad is scored, the report is returned to the customer, the asset is staged in a Supabase Storage bucket in the EU region during processing, and the staging row is deleted at the end of the scoring run. No part of that flow touches the benchmark pool. The customer's report is theirs.

The benchmark only grows from a customer ad when the customer affirmatively flips the opt-in flag on their project, checking the "contribute to the benchmark" box, agreeing to the licence terms, and accepting that the ad will be stored with full provenance metadata under their advertiser entity. The flag is stored as opt_in_flag = true on the row, so any future audit immediately knows the row entered via consent rather than via a public surface. We never flip the flag silently. We never ingest by default. We never re-prompt the customer after a project has closed.

Default state. Customer upload → scored → returned → staging deleted. opt_in_flag is never set by default.
Affirmative opt-in. Customer toggles the contribution flag on the project, accepts the licence terms in plain English (not a dark pattern), and the row enters the next pool version, never the current one.
Per-asset revocability. Opt-in is revocable at any time. Revocation removes the row from the next pool version and is honoured within 30 days from the canonical row and any downstream report cards.
No retroactive ingestion. Ads that scored before the opt-in pathway existed (pre-2026-04 cohorts) never enter the benchmark. The pathway is forward-only.

House rule

"The default for a customer ad is invisible. The benchmark grows only from rows where the customer said yes in plain English. No exceptions. No silent ingestion. No dark-pattern opt-outs."

05 · Five hard questions

Ask any vendor. Including us.

If you are evaluating a creative-testing vendor, including SaliencyLab, these are the five questions whose answers separate a real provenance practice from a marketing line. We can answer all five on this page. The point of publishing them is to make it expensive for any vendor to dodge them.

Q.01

Show me your provenance log for one ad.

Pick one ad in the benchmark, any ad, and ask the vendor to return the source_url, capture_date, licence_class and capture_method. If the answer is "we can't share that" or "it's proprietary," the row does not exist. Provenance is either auditable or it is fiction. We publish a worked example in chapter 3 of this page.

Q.02

What is your licence-class breakdown across the pool?

"Public API," "public transparency surface," "commissioned material," "scraped", every ad belongs in one of these classes. A vendor who cannot break their pool down by licence class is a vendor who does not know what is in it. Ours: ~95% public-transparency / official-API, ~5% documented-provenance editorial, 0% scraped.

Q.03

How does a customer's ad enter your benchmark, by default or by opt-in?

If the answer is "we anonymise it" or "we use it to improve the model" without a hard opt-in flag, the vendor is treating customer creatives as training data by default. That is the procurement red flag. The only acceptable answer is: explicit opt-in, revocable, never retroactive. We are explicit on that in chapter 4.

Q.04

Where do the ads and audit logs physically live?

Data-residency posture changes everything for an EU procurement team. Ours: Supabase Postgres in Frankfurt, Vertex AI processing in EU regions, Cloud Run workers in EU regions, customer assets in EU staging buckets deleted post-run. If a vendor cannot name regions, the answer is "wherever AWS put it."

Q.05

If I send a takedown for one ad, how long until it is gone, from the pool and from every downstream report?

A real audit log enables a real takedown. We honour removals within 30 days from the canonical row and any downstream report cards that reference the row. A vendor whose pool is monolithic, no per-row provenance, no immutable versioning, cannot execute a clean takedown even if they want to.

The five questions are not a competitive moat, they are a baseline. We expect them to become standard procurement language for creative-testing vendors over the next 24 months as data-protection regulators catch up to AI-scored marketing data. Publishing them now is the cheapest way to raise the floor for everyone.

06 · Compliance disclosures

GDPR. CCPA. Residency.

The compliance posture is shaped by one simple fact: the benchmark is constructed from advertising creative, not personal data. The corporate brand assets in an ad are not PII. That distinction governs almost every regulatory question that follows.

GDPR, EU-sourced ads

Ads ingested from Meta Ad Library API for EU markets, TikTok Ad Library for EU markets, and Google Ads Transparency Center for EU markets are creative materials published by advertisers to public audiences via the platforms' regulator-mandated transparency surfaces. They are not personal data in the GDPR sense. Where a human appearance is in the creative, a presenter, a model, a creator, that appearance is governed by the advertiser's release with that individual at the moment the ad was produced. SaliencyLab does not re-publish the asset; we score it and retain a source_url back to the canonical transparency entry. Removal requests against our row are honoured within 30 days.

For customer-opt-in rows where the brand is the SaliencyLab customer, the contractual basis is the data processing agreement signed at sign-up. The customer is the data controller for their own creative. SaliencyLab is the processor. The agreement names sub-processors (Supabase, Google Cloud) and the EU regions used. Standard contractual clauses are in place where any sub-processor touches data outside the EU. None of our processing pipeline routinely does.

CCPA, US-sourced ads

For ads sourced from US Meta Ad Library API or Google Ads Transparency US surfaces, the same logic applies, they are advertising creative published by advertisers to a US public audience, not personal data covered by the CCPA. For US-based customers using Score-My-Ad, the DPA includes CCPA-aligned terms for the processing of any incidental personal data that might appear in customer creative (e.g. a US presenter's likeness in an opted-in ad). Customer rights, to know, to delete, to opt out of sale, are addressed by the standing privacy notice and operationalised through the in-product project deletion flow.

Data residency

EU-first by default. Supabase Postgres in eu-central-1 (Frankfurt). Vertex AI inference in EU regions. Cloud Run video-frame workers in EU regions. Audit log rows and benchmark metadata never leave EU residency. Customer assets are staged in EU Supabase Storage during scoring and deleted from the staging bucket at the end of the run; the row in the audit log persists, the staged asset does not. The choice is deliberate, the majority of brand-side procurement teams ask "where does the data live" on the first call, and the answer is the answer.

The simple line

"The benchmark is built from advertising creative published to public audiences via platform transparency surfaces. That is not personal data. Customer creatives enter the benchmark only by explicit opt-in. Audit logs are EU-resident. Removal requests are honoured within 30 days."

07 · The audit log

The schema we store per ad.

The 22-field audit row is the single artifact that turns "we do not scrape" from a marketing claim into an auditable practice. Below is the schema in full, every field, why it exists, and which procurement question it answers.

Field	Type	Why it exists
ad_id	uuid	Stable primary key, survives platform URL changes and pool version migrations.
brand	text	Human-readable brand name as published on the source surface.
advertiser_id	text	Platform-assigned advertiser identifier, enables aggregation across cuts from the same advertiser.
platform	enum	One of: meta, tiktok, youtube, google_display. Drives benchmark-cohort selection at scoring time.
market	iso-3166	Market the ad ran in, basis for region-controlled benchmark percentiles.
language	bcp-47	Language of the creative, basis for language-controlled benchmark percentiles.
category	enum	20-class taxonomy (Beauty, Food & Beverage, Retail, Auto, etc.), basis for category cohort.
subcategory	enum	Fine-grained slice within the category, basis for tight cohort matching.
format	text	Aspect ratio + duration band, basis for format-controlled cohorts.
source	enum	One of the five pathways defined in chapter 2, the licence class root.
source_url	url	Canonical link back to the transparency-surface entry, the chain of custody.
licence_class	enum	public_api / public_transparency / documented_provenance / opt_in. Drives the licensed-use surface.
capture_date	timestamptz	Moment the row was ingested, basis for temporal audits and freshness checks.
capture_method	enum	Documented endpoint, manual editorial entry, or customer opt-in submission.
opt_in_flag	bool	True only when the row is a customer-contributed ad with affirmative opt-in. False for every public-surface row.
public_engagement_signal	text	CTR percentile band / view count band / engagement quintile, the outcome ground for OOS validation.
benchmark_pool_version	text	Immutable pool snapshot the row belongs to. Pinned per report.
model_version	text	Scoring-model version applied to the row, required for reproducibility of any historic report.
scoring_run_id	uuid	Backlink to the scoring run that produced the row's stored scores, full re-traceability.
confidence_label	enum	high / medium / low, pipeline's own honesty signal on attribute detection.
sampling_band	text	Where in the sampling distribution the row sat at capture, guards against silent cohort drift.
residency	enum	Physical region the row is stored in. Default eu-frankfurt.

The 22 fields are not aspirational. They are the actual columns on the benchmark_pool table in the SaliencyLab production database. The dataset paper (Paper 05, target Scientific Data) publishes this exact schema as part of the open release under CC-BY-NC. After that paper, the audit log is no longer a vendor claim, it is a citation.

The closing line

"The score is infrastructure. The benchmark is the product. The audit log is the licence. Without all three, you do not have a creative-testing vendor, you have a number, and a story about the number."

Questions we keep getting

Asked, plainly.

Does SaliencyLab scrape ads from social platforms?

No. Every one of the 2,047 ads in pool v.2026-05 comes from an official transparency surface, Meta Ad Library API, TikTok Creative Center, TikTok Ad Library, Google Ads Transparency Center, or from manual editorial curation with documented provenance. Scraping breaks platform terms of service and creates downstream risk for the customer the benchmark serves. We do not do it. Every ad in the pool stores its source_url, capture_date, and licence_class so the provenance is auditable per row.

How does a customer's uploaded ad enter the benchmark?

Only with explicit opt-in via the Score-My-Ad pathway. The default for any customer upload is private, the ad is scored, the report is returned, the asset is staged in a Supabase Storage bucket in the EU region during processing, and the staging row is deleted at the end of the scoring run. The benchmark grows only from rows where opt_in_flag = true. There is no silent ingestion. There is no dark-pattern opt-out. The flag is part of the audit log so anyone reading the row knows which pathway it entered through.

What licence class covers ads pulled from Meta Ad Library API?

Meta Ad Library API is an official public API operated by Meta under their Terms for Ads About Social Issues, Elections or Politics and the broader Ad Library transparency programme. Programmatic access via the documented API is the licensed pathway. We ingest advertiser metadata, ad format, region, and creative URL via the API; we do not bypass rate limits and we do not scrape the web interface. Captured rows record source_url pointing back to the canonical Ad Library entry.

Where are the ad assets and audit logs stored?

Audit log rows and benchmark metadata live in Supabase Postgres in the EU region (Frankfurt). Vertex AI processing happens in Google Cloud EU regions. Cloud Run workers for video frame extraction run in EU regions. Customer assets are staged in Supabase Storage in the EU during the scoring run and deleted from the staging bucket after the pipeline completes. The data-residency choice is deliberate, EU-first because the majority of brand-side procurement teams ask for it on the first call.

How does SaliencyLab handle GDPR for EU-sourced public ads?

The ads in the benchmark are advertising creative, not personal data, they were published by advertisers for public audiences via platform transparency surfaces, and contain corporate brand assets, not consumer PII. Where a human appears in a creative (a presenter, a model, a UGC creator), the appearance is governed by the original advertiser's release with that individual, not by SaliencyLab. We do not re-publish the asset; we score it and store a source_url back to the canonical transparency entry. Removal requests honoured within 30 days from the canonical row and any downstream report cards.

What is the difference between TikTok Creative Center and TikTok Ad Library?

Both are public transparency surfaces run by TikTok and they cover different cohorts. TikTok Creative Center surfaces top-performing ads with public CTR percentile bands, engagement signals and view counts, useful for outcome calibration. TikTok Ad Library is the broader transparency obligation surface across all running ads in a given market. We ingest from both and tag each row with its source so the licence class and the outcome-signal availability are explicit per ad.

The question after the demo.

Five sources. All licensed.

A single row, fully traced.

Why the source_url is non-negotiable

Customer ads. Explicit opt-in only.

Ask any vendor. Including us.

GDPR. CCPA. Residency.

GDPR, EU-sourced ads

CCPA, US-sourced ads

Data residency

The schema we store per ad.

Companion defenses.

Asked, plainly.