Every score is a tagging.
Before any KPI is computed, before any benchmark percentile is reported, before any verdict appears on a report, Gemini 2.5 Flash has to look at the ad and tag its attributes. Hook density. Brand cue placement. On-screen text class. Sound-off comprehension. Forty-three attributes in total. If that tagging is unreliable, every number downstream is decorated noise.
Most LLM-as-judge pipelines in production do not have a public IRR study. They have a vibe, "we spot-checked some outputs, they looked right." That is not science, and at the scale of a creative-testing platform it is not even due diligence. The whole point of using an LLM as the tagging substrate is to claim machine-grade consistency; the only way to substantiate the claim is a formal IRR study against trained human coders.
This paper is that study. It is pre-printed here, ahead of journal submission, for two reasons: because buyers should be able to read it before they trust the score, and because the IRR floor is a methodological artifact that the field should be holding all of us to, not just SaliencyLab.
"The reliability of the tagging is the reliability of the platform. If the substrate is unreliable, the score is a number that looks like a number."
Five hundred ads. Three coders. Six thousand four hundred and fifty decisions.
The study is a multi-rater inter-rater reliability design with one LLM rater (Gemini 2.5 Flash) and three trained human coders. Each rater independently tags every ad in the 500-ad cohort against the same 43-attribute codebook. All four raters are blinded to each other's tagging during the coding window.
| Parameter | Value | Notes |
|---|---|---|
| Ad cohort | n = 500 | Stratified across platform (TikTok 200, YouTube 150, Meta 150) |
| Attributes per ad | 43 | Hook, brand, emotion, pacing, CTA, distinctive-asset families |
| Paired decisions | 6,450 | 500 × 43, minus 1,000 not-applicable cells (audio attrs on silent ads) |
| LLM rater | Gemini 2.5 Flash | Vertex AI, Zod-structured output, temperature 0.2 |
| Human coders | 3 | Agency creative, marketing-science researcher, platform-side performance lead |
| Training cohort | 50 ads adjudicated | Codebook walkthrough + adjudicated ground truth before study window |
| Reliability metric | Cohen's κ / Fleiss' κ | Cohen pairwise; Fleiss across the three humans for multi-class attrs |
| Study window | 2026-03 to 2026-04 | Coders blinded; LLM output captured at scoring time, not re-run |
What "blinded" means in practice. Each human coder receives a randomized batch of 250 ads from the 500-cohort with no visibility into the LLM tagging or the other coders' tagging. The LLM output is captured at the moment of original scoring, never re-prompted to match human consensus. The codebook is the only shared artifact.
Headline: 0.847. Substructure: published.
Aggregate observed agreement between the LLM and the modal human decision across all 6,450 paired decisions is 0.847, which is the basis of the "~85% accuracy" line in the public claims. The headline number is useful as a one-liner. The substructure is what matters for trust.
| Attribute family | Attrs in family | Cohen's κ (LLM vs human modal) | Reliability class |
|---|---|---|---|
| Logo / brand-mark presence | 3 | 0.91 | Strong |
| On-screen CTA structure | 4 | 0.88 | Strong |
| Music tempo / silence class | 3 | 0.86 | Strong |
| Pacing band (shot count, cut rate) | 4 | 0.82 | Strong |
| Distinctive-asset density | 5 | 0.74 | Moderate |
| Hook structure (first 2-second class) | 6 | 0.71 | Moderate |
| Emotion family (multi-class) | 7 | 0.66 | Moderate |
| Sound-off comprehension | 3 | 0.63 | Moderate |
| Multi-shot brand-cue continuity | 3 | 0.52 | Weak |
| MENA-dialect cultural ritual | 3 | 0.49 | Weak |
| Irony / visual subversion of verbal frame | 2 | 0.43 | Weak |
// LLM-vs-human agreement by attribute family
The three red rows are the paper.
A reader who only reads the headline learns "~85% accurate." A reader who reads the kappa table learns where that 85% sits, and, more importantly, where the 15% lives. Three attribute families sit below κ < 0.55 and they are not random.
Multi-shot brand-cue continuity (κ=0.52). The LLM treats each shot independently and underweights the cumulative brand-cue density across a sequence. A 15-second ad with a logo glimpse in shot 2 and a colour wash in shot 5 reads as "weak brand cue" to the model but as "moderate brand cue" to the humans who watched it as a continuous artifact. This is an architecture-level disagreement, not a tagging error.
MENA-dialect cultural ritual (κ=0.49). Darija and Khaleeji ritual cues, a specific hospitality gesture, the framing of a wedding-procession motif, the use of a particular call-and-response idiom, are routinely missed against the model's Egyptian Arabic baseline. The model has the language but not the regional ritual vocabulary. This is a training-data gap and we are documenting it in the limitations section, not hiding it.
Irony / visual subversion of verbal frame (κ=0.43). Sarcasm carried by a visual that undermines the spoken claim, the classic "we are not like other ads" cut, is read literally about 38% of the time. The model picks up the verbal claim and misses the visual counter. Humans catch this almost universally because irony is a social construct, not a textual one.
"A score that is 85% accurate is interesting. A score that is 85% accurate and tells you which 15% you cannot trust is usable. The three red rows are the difference between a marketing claim and a methodology."
Three cases. The seams.
The full paper carries an appendix of fifteen disagreement case studies. The three reproduced below are representative of the three failure families above.
LLM tagging
"Weak brand cue. Logo appears once at 0:02, no further reinforcement. Brand recall risk: high."
Modal human tagging
"Moderate brand cue. Logo at 0:02; brand colour wash at 0:07; product silhouette at 0:11; pack shot at 0:14. The cue density is distributed but cumulatively present."
LLM tagging
"Cultural-resonance markers: low. Generic family gathering imagery. No ritual signifier detected."
Modal human tagging
"Cultural-resonance markers: high. The specific tea-pouring gesture and the placement of the host at the centre of the frame are MENA hospitality codes, the ad is using them deliberately."
LLM tagging
"Tone: sincere. Voiceover claims premium positioning ('the only one that...'); confidence: high."
Modal human tagging
"Tone: ironic. Voiceover sincerity is undercut by a deadpan visual cut to a stock-image collage; the ad is mocking premium-positioning conventions. The whole hook is the subversion."
What the platform does with this today.
The IRR results are not a paper that sits on a shelf, they are wired into the live pipeline. Every RoastIQ report carries a confidence_label per KPI, and the labels are derived from the IRR-class table above.
- Strong κ (≥0.80) attributes, logo, CTA, music tempo, pacing, feed KPIs with full confidence weighting. The "Get Noticed" KPI is dominated by these and inherits their reliability.
- Moderate κ (0.60–0.79) attributes, distinctive-asset density, hook structure, emotion, sound-off comprehension, feed KPIs with a confidence_label of "moderate" and an explicit note in the decision_trace.
- Weak κ (<0.55) attributes, multi-shot brand continuity, MENA cultural ritual, irony detection, are flagged. Where an ad triggers these flags, the report surfaces the limitation in plain English: "This ad relies on irony; our model reads ironic ads with κ=0.43 reliability. Treat the Sell Proposition score as directional."
- Synthetic Users compensation. Where the LLM tagging hits a weak band, the Synthetic Users panel is one of the ways a buyer can triangulate, persona-grounded interpretation closes part of the gap the tagging leaves open.
This is what it looks like for an IRR study to do its job: not "we got 85%" stuck on a marketing page, but a confidence layer that travels with every report and tells the buyer exactly where to trust the score and where to caveat it.
What this study does not show.
An honest paper publishes its limits in the body, not in a footnote. Four things this IRR study does not establish, and what we are doing about each.
- It does not validate the KPI weights. IRR measures attribute-tagging reliability, not the correctness of the 25/20/20/20/15 composite. That validation lives in the held-out OOS Spearman ρ work (ρ +0.30–0.32) summarised in the main notebook.
- It is biased toward Anglo-Western creative grammar. The 500-ad cohort is stratified across platforms but skewed Anglo-Western in content. The MENA cultural ritual failure is in part a sampling artifact; the next IRR round will rebalance toward MENA-origin ads and re-test specifically against that gap.
- It tests one LLM version at one temperature. Gemini 2.5 Flash at temperature 0.2. The kappa table is pinned to that combination; Paper 04.1 (in pipeline) will replicate with Gemini Pro and at temperatures 0.0 and 0.4 to quantify sensitivity.
- It does not test panel-of-LLMs vs single-LLM tagging. Whether ensembling three LLM raters lifts kappa above single-rater Gemini is an open question; the experimental design is drafted but the run is not yet funded.
Next steps. Full paper submission to Information Systems Research in Q3 2026. The replication package will include the codebook, the 500-ad ID list with public source URLs, the per-ad LLM and human tags, and the kappa computation notebooks. A skeptical reviewer should be able to reproduce every number on this page without our cooperation.