Skip to main content
RoastIQBuyerLensHugoPricingBlogAbout
Book a demoSign inStart free →
Methodology · published 18 May 2026 · 7 min read

We do not scrape.
Here is the proof.

Every ad in the SaliencyLab benchmark pool comes from an official transparency surface or from manual editorial curation with documented provenance. Per-ad audit log, licence-class breakdown, customer opt-in pathway, GDPR and CCPA posture, all in one page, all auditable on request.

5 public sources 2,047 ads in pool 0 scraped 22 metadata fields per ad v.2026-05 pinned
Oussama Nakhil portrait
Oussama Nakhil · Founder & CEO
Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights, where vendor procurement, data-licensing terms, and brand-safety audits were part of every quarter. This page is written from the seat of the buyer who used to ask these questions.
0
Scraped ads in the pool
22
Metadata fields stored per ad
EU
Default data residency, Frankfurt + EU regions

The question after the demo.

Almost every procurement conversation after a strong product demo arrives at the same question, usually phrased politely: "how exactly did you build the benchmark, and is anything in it going to embarrass us six months from now?"

The reason the question matters is not academic. A creative-testing benchmark sits inside a marketing-data supply chain that already includes a media agency, a creative agency, a measurement vendor, and a brand-safety vendor. A new tool in that stack inherits the legal posture of its weakest input. If a vendor scraped a platform to populate its benchmark, the customer who scores their ads against that benchmark inherits the platform's potential action against the vendor. Procurement teams know this. So do the legal teams that sign off on the SOW.

The honest defense is not "trust us." The honest defense is a per-row audit log. Every ad in pool v.2026-05 carries a source_url, a capture_date, a licence_class, an opt_in_flag, and a benchmark_pool_version. That is the row a procurement team should be able to ask to see, for one ad, for ten ads, for any ad they care about. We can produce it.

The procurement test

"If a vendor cannot point at one ad in their pool and show you the source URL, the capture date, and the licence class, the benchmark is not a benchmark. It is a marketing slide with a number on it."

Five sources. All licensed.

Every ad in the 2,047-row pool enters through one of five pathways. Four are official platform transparency surfaces. The fifth is editorial curation with documented provenance from public, opted-in, or commissioned material. We argue every pathway from its licence terms, not from convenience.

SourceLicence classWhat we use it forWhat we never use it forAds
Meta Ad Library API Official API Brand-ad cohort, advertiser metadata, ad format, creative URL via documented endpoints HTML-scraping the Ad Library web UI, bypassing rate limits, joining to non-public user data ~440
TikTok Creative Center Public transparency Top-ads cohort, CTR percentile bands, view counts, engagement signals for outcome calibration Reposting creator videos outside the transparency frame, monetising the assets, joining to creator-private data ~700
TikTok Ad Library Public transparency Broader all-ads cohort, regional and language coverage, transparency-obligation surface Mass enumeration outside the published surface, scraping logged-out content not exposed by the library ~310
Google Ads Transparency Center Public disclosure YouTube In-Stream + Shorts cohort, advertiser disclosure metadata, ad format and region Joining to Google Ads account data, re-uploading the asset for commercial republishing ~403
Manual editorial curation Documented provenance Real cuts scored through the live pipeline, with per-ad provenance log captured at the moment of ingestion Anonymous uploads with no source, customer assets without opt-in flag, anything failing the provenance gate ~194

Notice the structure of the table. Each row has a "what we never use it for" column, that is the column that matters in a procurement conversation. A vendor who only publishes what they ingest has no observable line; a vendor who publishes both sides of the line has an auditable practice.

The four platform surfaces exist because regulators told the platforms to publish them. Meta launched the Ad Library in response to political-ad transparency obligations; TikTok and Google followed for the same reason. Using the transparency surfaces is the legally intended pathway, they were built so a third party could programmatically inspect what ads are running. That is what we do. Nothing more.

A single row, fully traced.

An abstract claim is a marketing line. A worked example is an audit. Below is the actual metadata structure that ships with one ad in pool v.2026-05, a Magalu (Brazil) e-commerce cut from Q1 2026, sourced via TikTok Creative Center. This is the row a procurement team can ask to see.

// benchmark_pool.row · v.2026-05 · row_id 01HQ9V…
{
  "ad_id": "slbm_2026_05_01492",
  "brand": "Magazine Luiza (Magalu)",
  "advertiser_id": "tt_creative_center_magalu_br",
  "platform": "tiktok",
  "market": "BR",
  "language": "pt-BR",
  "category": "retail · e-commerce",
  "format": "vertical 9:16 · 22s",
  "source": "tiktok_creative_center",
  "source_url": "https://ads.tiktok.com/business/creativecenter/topads/ ↩
                 br/pc/en?period=30®ion=BR&industry=retail"
,
  "licence_class": "public_transparency",
  "opt_in_flag": false,
  "capture_date": "2026-04-22T09:14:00Z",
  "capture_method": "public_endpoint_documented",
  "public_engagement_signal": "ctr_pct_band_top10",
  "benchmark_pool_version": "v.2026-05",
  "model_version": "gemini-2.5-flash · scoring-v3.1",
  "scoring_run_id": "run_01HRD3A1Q2C…",
  "confidence_label": "high",
  "sampling_band": "BR · retail · q1-2026",
  "residency": "eu-frankfurt"
}

Every field in that row is queryable. A buyer can ask us to filter the pool to "all BR retail ads captured between March and May 2026 with confidence_label = high" and we can return the list with all provenance fields attached. The row above is one of 2,047. The procurement defense is not "the pool is clean", it is "every row is traceable to a public URL captured at a recorded moment under a stated licence class." Those are different sentences. The second one is auditable.

Why the source_url is non-negotiable

A common shortcut in legacy benchmark construction is to strip the source URL once the asset is ingested, to keep the pool "clean" of platform-specific metadata. We do the opposite. The source URL is kept, indexed, and queryable. When a customer wants to verify provenance on a specific ad, the URL is the chain of custody. When a platform changes its transparency-surface structure (Meta and TikTok have both done so in the last 18 months), the URL is the migration anchor. Stripping it is an unforced error.

Customer ads. Explicit opt-in only.

The fifth pathway into the benchmark, and the only one that involves a customer's own creative, is the Score-My-Ad opt-in. It is the route by which the benchmark grows from real, live ads beyond what the public transparency surfaces cover. It is also the route most likely to be misunderstood by a buyer who has worked with a vendor that quietly ingested everything.

The default for any customer upload is private. The ad is scored, the report is returned to the customer, the asset is staged in a Supabase Storage bucket in the EU region during processing, and the staging row is deleted at the end of the scoring run. No part of that flow touches the benchmark pool. The customer's report is theirs.

The benchmark only grows from a customer ad when the customer affirmatively flips the opt-in flag on their project, checking the "contribute to the benchmark" box, agreeing to the licence terms, and accepting that the ad will be stored with full provenance metadata under their advertiser entity. The flag is stored as opt_in_flag = true on the row, so any future audit immediately knows the row entered via consent rather than via a public surface. We never flip the flag silently. We never ingest by default. We never re-prompt the customer after a project has closed.

  • Default state. Customer upload → scored → returned → staging deleted. opt_in_flag is never set by default.
  • Affirmative opt-in. Customer toggles the contribution flag on the project, accepts the licence terms in plain English (not a dark pattern), and the row enters the next pool version, never the current one.
  • Per-asset revocability. Opt-in is revocable at any time. Revocation removes the row from the next pool version and is honoured within 30 days from the canonical row and any downstream report cards.
  • No retroactive ingestion. Ads that scored before the opt-in pathway existed (pre-2026-04 cohorts) never enter the benchmark. The pathway is forward-only.
House rule

"The default for a customer ad is invisible. The benchmark grows only from rows where the customer said yes in plain English. No exceptions. No silent ingestion. No dark-pattern opt-outs."

Ask any vendor. Including us.

If you are evaluating a creative-testing vendor, including SaliencyLab, these are the five questions whose answers separate a real provenance practice from a marketing line. We can answer all five on this page. The point of publishing them is to make it expensive for any vendor to dodge them.

Q.01
Show me your provenance log for one ad.
Pick one ad in the benchmark, any ad, and ask the vendor to return the source_url, capture_date, licence_class and capture_method. If the answer is "we can't share that" or "it's proprietary," the row does not exist. Provenance is either auditable or it is fiction. We publish a worked example in chapter 3 of this page.
Q.02
What is your licence-class breakdown across the pool?
"Public API," "public transparency surface," "commissioned material," "scraped", every ad belongs in one of these classes. A vendor who cannot break their pool down by licence class is a vendor who does not know what is in it. Ours: ~95% public-transparency / official-API, ~5% documented-provenance editorial, 0% scraped.
Q.03
How does a customer's ad enter your benchmark, by default or by opt-in?
If the answer is "we anonymise it" or "we use it to improve the model" without a hard opt-in flag, the vendor is treating customer creatives as training data by default. That is the procurement red flag. The only acceptable answer is: explicit opt-in, revocable, never retroactive. We are explicit on that in chapter 4.
Q.04
Where do the ads and audit logs physically live?
Data-residency posture changes everything for an EU procurement team. Ours: Supabase Postgres in Frankfurt, Vertex AI processing in EU regions, Cloud Run workers in EU regions, customer assets in EU staging buckets deleted post-run. If a vendor cannot name regions, the answer is "wherever AWS put it."
Q.05
If I send a takedown for one ad, how long until it is gone, from the pool and from every downstream report?
A real audit log enables a real takedown. We honour removals within 30 days from the canonical row and any downstream report cards that reference the row. A vendor whose pool is monolithic, no per-row provenance, no immutable versioning, cannot execute a clean takedown even if they want to.

The five questions are not a competitive moat, they are a baseline. We expect them to become standard procurement language for creative-testing vendors over the next 24 months as data-protection regulators catch up to AI-scored marketing data. Publishing them now is the cheapest way to raise the floor for everyone.

GDPR. CCPA. Residency.

The compliance posture is shaped by one simple fact: the benchmark is constructed from advertising creative, not personal data. The corporate brand assets in an ad are not PII. That distinction governs almost every regulatory question that follows.

GDPR, EU-sourced ads

Ads ingested from Meta Ad Library API for EU markets, TikTok Ad Library for EU markets, and Google Ads Transparency Center for EU markets are creative materials published by advertisers to public audiences via the platforms' regulator-mandated transparency surfaces. They are not personal data in the GDPR sense. Where a human appearance is in the creative, a presenter, a model, a creator, that appearance is governed by the advertiser's release with that individual at the moment the ad was produced. SaliencyLab does not re-publish the asset; we score it and retain a source_url back to the canonical transparency entry. Removal requests against our row are honoured within 30 days.

For customer-opt-in rows where the brand is the SaliencyLab customer, the contractual basis is the data processing agreement signed at sign-up. The customer is the data controller for their own creative. SaliencyLab is the processor. The agreement names sub-processors (Supabase, Google Cloud) and the EU regions used. Standard contractual clauses are in place where any sub-processor touches data outside the EU. None of our processing pipeline routinely does.

CCPA, US-sourced ads

For ads sourced from US Meta Ad Library API or Google Ads Transparency US surfaces, the same logic applies, they are advertising creative published by advertisers to a US public audience, not personal data covered by the CCPA. For US-based customers using Score-My-Ad, the DPA includes CCPA-aligned terms for the processing of any incidental personal data that might appear in customer creative (e.g. a US presenter's likeness in an opted-in ad). Customer rights, to know, to delete, to opt out of sale, are addressed by the standing privacy notice and operationalised through the in-product project deletion flow.

Data residency

EU-first by default. Supabase Postgres in eu-central-1 (Frankfurt). Vertex AI inference in EU regions. Cloud Run video-frame workers in EU regions. Audit log rows and benchmark metadata never leave EU residency. Customer assets are staged in EU Supabase Storage during scoring and deleted from the staging bucket at the end of the run; the row in the audit log persists, the staged asset does not. The choice is deliberate, the majority of brand-side procurement teams ask "where does the data live" on the first call, and the answer is the answer.

The simple line

"The benchmark is built from advertising creative published to public audiences via platform transparency surfaces. That is not personal data. Customer creatives enter the benchmark only by explicit opt-in. Audit logs are EU-resident. Removal requests are honoured within 30 days."

The schema we store per ad.

The 22-field audit row is the single artifact that turns "we do not scrape" from a marketing claim into an auditable practice. Below is the schema in full, every field, why it exists, and which procurement question it answers.

FieldTypeWhy it exists
ad_iduuidStable primary key, survives platform URL changes and pool version migrations.
brandtextHuman-readable brand name as published on the source surface.
advertiser_idtextPlatform-assigned advertiser identifier, enables aggregation across cuts from the same advertiser.
platformenumOne of: meta, tiktok, youtube, google_display. Drives benchmark-cohort selection at scoring time.
marketiso-3166Market the ad ran in, basis for region-controlled benchmark percentiles.
languagebcp-47Language of the creative, basis for language-controlled benchmark percentiles.
categoryenum20-class taxonomy (Beauty, Food & Beverage, Retail, Auto, etc.), basis for category cohort.
subcategoryenumFine-grained slice within the category, basis for tight cohort matching.
formattextAspect ratio + duration band, basis for format-controlled cohorts.
sourceenumOne of the five pathways defined in chapter 2, the licence class root.
source_urlurlCanonical link back to the transparency-surface entry, the chain of custody.
licence_classenumpublic_api / public_transparency / documented_provenance / opt_in. Drives the licensed-use surface.
capture_datetimestamptzMoment the row was ingested, basis for temporal audits and freshness checks.
capture_methodenumDocumented endpoint, manual editorial entry, or customer opt-in submission.
opt_in_flagboolTrue only when the row is a customer-contributed ad with affirmative opt-in. False for every public-surface row.
public_engagement_signaltextCTR percentile band / view count band / engagement quintile, the outcome ground for OOS validation.
benchmark_pool_versiontextImmutable pool snapshot the row belongs to. Pinned per report.
model_versiontextScoring-model version applied to the row, required for reproducibility of any historic report.
scoring_run_iduuidBacklink to the scoring run that produced the row's stored scores, full re-traceability.
confidence_labelenumhigh / medium / low, pipeline's own honesty signal on attribute detection.
sampling_bandtextWhere in the sampling distribution the row sat at capture, guards against silent cohort drift.
residencyenumPhysical region the row is stored in. Default eu-frankfurt.

The 22 fields are not aspirational. They are the actual columns on the benchmark_pool table in the SaliencyLab production database. The dataset paper (Paper 05, target Scientific Data) publishes this exact schema as part of the open release under CC-BY-NC. After that paper, the audit log is no longer a vendor claim, it is a citation.

The closing line

"The score is infrastructure. The benchmark is the product. The audit log is the licence. Without all three, you do not have a creative-testing vendor, you have a number, and a story about the number."

Asked, plainly.

Does SaliencyLab scrape ads from social platforms?
No. Every one of the 2,047 ads in pool v.2026-05 comes from an official transparency surface, Meta Ad Library API, TikTok Creative Center, TikTok Ad Library, Google Ads Transparency Center, or from manual editorial curation with documented provenance. Scraping breaks platform terms of service and creates downstream risk for the customer the benchmark serves. We do not do it. Every ad in the pool stores its source_url, capture_date, and licence_class so the provenance is auditable per row.
How does a customer's uploaded ad enter the benchmark?
Only with explicit opt-in via the Score-My-Ad pathway. The default for any customer upload is private, the ad is scored, the report is returned, the asset is staged in a Supabase Storage bucket in the EU region during processing, and the staging row is deleted at the end of the scoring run. The benchmark grows only from rows where opt_in_flag = true. There is no silent ingestion. There is no dark-pattern opt-out. The flag is part of the audit log so anyone reading the row knows which pathway it entered through.
What licence class covers ads pulled from Meta Ad Library API?
Meta Ad Library API is an official public API operated by Meta under their Terms for Ads About Social Issues, Elections or Politics and the broader Ad Library transparency programme. Programmatic access via the documented API is the licensed pathway. We ingest advertiser metadata, ad format, region, and creative URL via the API; we do not bypass rate limits and we do not scrape the web interface. Captured rows record source_url pointing back to the canonical Ad Library entry.
Where are the ad assets and audit logs stored?
Audit log rows and benchmark metadata live in Supabase Postgres in the EU region (Frankfurt). Vertex AI processing happens in Google Cloud EU regions. Cloud Run workers for video frame extraction run in EU regions. Customer assets are staged in Supabase Storage in the EU during the scoring run and deleted from the staging bucket after the pipeline completes. The data-residency choice is deliberate, EU-first because the majority of brand-side procurement teams ask for it on the first call.
How does SaliencyLab handle GDPR for EU-sourced public ads?
The ads in the benchmark are advertising creative, not personal data, they were published by advertisers for public audiences via platform transparency surfaces, and contain corporate brand assets, not consumer PII. Where a human appears in a creative (a presenter, a model, a UGC creator), the appearance is governed by the original advertiser's release with that individual, not by SaliencyLab. We do not re-publish the asset; we score it and store a source_url back to the canonical transparency entry. Removal requests honoured within 30 days from the canonical row and any downstream report cards.
What is the difference between TikTok Creative Center and TikTok Ad Library?
Both are public transparency surfaces run by TikTok and they cover different cohorts. TikTok Creative Center surfaces top-performing ads with public CTR percentile bands, engagement signals and view counts, useful for outcome calibration. TikTok Ad Library is the broader transparency obligation surface across all running ads in a given market. We ingest from both and tag each row with its source so the licence class and the outcome-signal availability are explicit per ad.