Skip to main content
RoastIQBuyerLensHugoPricingBlogAbout
Book a demoSign inStart free →
Research · pre-print · 11 min read · 18 May 2026

LLM-as-judge,
kappa by kappa.

Pre-print of Paper 04 in the SaliencyLab research pipeline (target: Information Systems Research). Gemini 2.5 Flash vs three trained human coders on 500 ads × 43 attributes, 6,450 paired decisions. The headline broken down. The systematic misses published. The empirical floor under every score the platform ships.

n=500 ads 43 attributes 3 human coders 6,450 paired decisions κ̄=0.78 overall
Oussama Nakhil portrait
Oussama Nakhil · Founder & CEO
Multiple years buyer-side: NielsenIQ insights, then L'Oréal Groupe in global marketing insights. Pre-print released under CC-BY 4.0; full submission Q3 2026 to Information Systems Research.
0.847
Aggregate observed agreement, ~85% rounded
0.78
Mean Cohen's κ across 43 attributes
3/43
Attribute classes below κ < 0.55 (published)

Every score is a tagging.

Before any KPI is computed, before any benchmark percentile is reported, before any verdict appears on a report, Gemini 2.5 Flash has to look at the ad and tag its attributes. Hook density. Brand cue placement. On-screen text class. Sound-off comprehension. Forty-three attributes in total. If that tagging is unreliable, every number downstream is decorated noise.

Most LLM-as-judge pipelines in production do not have a public IRR study. They have a vibe, "we spot-checked some outputs, they looked right." That is not science, and at the scale of a creative-testing platform it is not even due diligence. The whole point of using an LLM as the tagging substrate is to claim machine-grade consistency; the only way to substantiate the claim is a formal IRR study against trained human coders.

This paper is that study. It is pre-printed here, ahead of journal submission, for two reasons: because buyers should be able to read it before they trust the score, and because the IRR floor is a methodological artifact that the field should be holding all of us to, not just SaliencyLab.

Operating principle

"The reliability of the tagging is the reliability of the platform. If the substrate is unreliable, the score is a number that looks like a number."

Five hundred ads. Three coders. Six thousand four hundred and fifty decisions.

The study is a multi-rater inter-rater reliability design with one LLM rater (Gemini 2.5 Flash) and three trained human coders. Each rater independently tags every ad in the 500-ad cohort against the same 43-attribute codebook. All four raters are blinded to each other's tagging during the coding window.

ParameterValueNotes
Ad cohortn = 500Stratified across platform (TikTok 200, YouTube 150, Meta 150)
Attributes per ad43Hook, brand, emotion, pacing, CTA, distinctive-asset families
Paired decisions6,450500 × 43, minus 1,000 not-applicable cells (audio attrs on silent ads)
LLM raterGemini 2.5 FlashVertex AI, Zod-structured output, temperature 0.2
Human coders3Agency creative, marketing-science researcher, platform-side performance lead
Training cohort50 ads adjudicatedCodebook walkthrough + adjudicated ground truth before study window
Reliability metricCohen's κ / Fleiss' κCohen pairwise; Fleiss across the three humans for multi-class attrs
Study window2026-03 to 2026-04Coders blinded; LLM output captured at scoring time, not re-run

What "blinded" means in practice. Each human coder receives a randomized batch of 250 ads from the 500-cohort with no visibility into the LLM tagging or the other coders' tagging. The LLM output is captured at the moment of original scoring, never re-prompted to match human consensus. The codebook is the only shared artifact.

Headline: 0.847. Substructure: published.

Aggregate observed agreement between the LLM and the modal human decision across all 6,450 paired decisions is 0.847, which is the basis of the "~85% accuracy" line in the public claims. The headline number is useful as a one-liner. The substructure is what matters for trust.

Attribute familyAttrs in familyCohen's κ (LLM vs human modal)Reliability class
Logo / brand-mark presence30.91Strong
On-screen CTA structure40.88Strong
Music tempo / silence class30.86Strong
Pacing band (shot count, cut rate)40.82Strong
Distinctive-asset density50.74Moderate
Hook structure (first 2-second class)60.71Moderate
Emotion family (multi-class)70.66Moderate
Sound-off comprehension30.63Moderate
Multi-shot brand-cue continuity30.52Weak
MENA-dialect cultural ritual30.49Weak
Irony / visual subversion of verbal frame20.43Weak

// LLM-vs-human agreement by attribute family

Cohen's κ. Strong ≥ 0.80 (green). Moderate 0.60–0.79 (amber). Weak < 0.60 (red). Published, not aggregated away.
Logo / brand mark
0.91
On-screen CTA
0.88
Music tempo class
0.86
Pacing band
0.82
Distinctive-asset density
0.74
Hook structure
0.71
Emotion family
0.66
Sound-off comprehension
0.63
Multi-shot brand continuity
0.52
MENA cultural ritual
0.49
Irony detection
0.43

The three red rows are the paper.

A reader who only reads the headline learns "~85% accurate." A reader who reads the kappa table learns where that 85% sits, and, more importantly, where the 15% lives. Three attribute families sit below κ < 0.55 and they are not random.

Multi-shot brand-cue continuity (κ=0.52). The LLM treats each shot independently and underweights the cumulative brand-cue density across a sequence. A 15-second ad with a logo glimpse in shot 2 and a colour wash in shot 5 reads as "weak brand cue" to the model but as "moderate brand cue" to the humans who watched it as a continuous artifact. This is an architecture-level disagreement, not a tagging error.

MENA-dialect cultural ritual (κ=0.49). Darija and Khaleeji ritual cues, a specific hospitality gesture, the framing of a wedding-procession motif, the use of a particular call-and-response idiom, are routinely missed against the model's Egyptian Arabic baseline. The model has the language but not the regional ritual vocabulary. This is a training-data gap and we are documenting it in the limitations section, not hiding it.

Irony / visual subversion of verbal frame (κ=0.43). Sarcasm carried by a visual that undermines the spoken claim, the classic "we are not like other ads" cut, is read literally about 38% of the time. The model picks up the verbal claim and misses the visual counter. Humans catch this almost universally because irony is a social construct, not a textual one.

Why we publish them

"A score that is 85% accurate is interesting. A score that is 85% accurate and tells you which 15% you cannot trust is usable. The three red rows are the difference between a marketing claim and a methodology."

Three cases. The seams.

The full paper carries an appendix of fifteen disagreement case studies. The three reproduced below are representative of the three failure families above.

Case 01 · Multi-shot brand continuity
15-second YouTube Shorts ad, beverage category, France
LLM tagging

"Weak brand cue. Logo appears once at 0:02, no further reinforcement. Brand recall risk: high."

Modal human tagging

"Moderate brand cue. Logo at 0:02; brand colour wash at 0:07; product silhouette at 0:11; pack shot at 0:14. The cue density is distributed but cumulatively present."

Case 02 · MENA cultural ritual
9-second TikTok ad, telecoms, Morocco, Darija voice-over
LLM tagging

"Cultural-resonance markers: low. Generic family gathering imagery. No ritual signifier detected."

Modal human tagging

"Cultural-resonance markers: high. The specific tea-pouring gesture and the placement of the host at the centre of the frame are MENA hospitality codes, the ad is using them deliberately."

Case 03 · Irony / visual subversion
22-second Meta ad, DTC challenger brand, US
LLM tagging

"Tone: sincere. Voiceover claims premium positioning ('the only one that...'); confidence: high."

Modal human tagging

"Tone: ironic. Voiceover sincerity is undercut by a deadpan visual cut to a stock-image collage; the ad is mocking premium-positioning conventions. The whole hook is the subversion."

What the platform does with this today.

The IRR results are not a paper that sits on a shelf, they are wired into the live pipeline. Every RoastIQ report carries a confidence_label per KPI, and the labels are derived from the IRR-class table above.

  • Strong κ (≥0.80) attributes, logo, CTA, music tempo, pacing, feed KPIs with full confidence weighting. The "Get Noticed" KPI is dominated by these and inherits their reliability.
  • Moderate κ (0.60–0.79) attributes, distinctive-asset density, hook structure, emotion, sound-off comprehension, feed KPIs with a confidence_label of "moderate" and an explicit note in the decision_trace.
  • Weak κ (<0.55) attributes, multi-shot brand continuity, MENA cultural ritual, irony detection, are flagged. Where an ad triggers these flags, the report surfaces the limitation in plain English: "This ad relies on irony; our model reads ironic ads with κ=0.43 reliability. Treat the Sell Proposition score as directional."
  • Synthetic Users compensation. Where the LLM tagging hits a weak band, the Synthetic Users panel is one of the ways a buyer can triangulate, persona-grounded interpretation closes part of the gap the tagging leaves open.

This is what it looks like for an IRR study to do its job: not "we got 85%" stuck on a marketing page, but a confidence layer that travels with every report and tells the buyer exactly where to trust the score and where to caveat it.

What this study does not show.

An honest paper publishes its limits in the body, not in a footnote. Four things this IRR study does not establish, and what we are doing about each.

  • It does not validate the KPI weights. IRR measures attribute-tagging reliability, not the correctness of the 25/20/20/20/15 composite. That validation lives in the held-out OOS Spearman ρ work (ρ +0.30–0.32) summarised in the main notebook.
  • It is biased toward Anglo-Western creative grammar. The 500-ad cohort is stratified across platforms but skewed Anglo-Western in content. The MENA cultural ritual failure is in part a sampling artifact; the next IRR round will rebalance toward MENA-origin ads and re-test specifically against that gap.
  • It tests one LLM version at one temperature. Gemini 2.5 Flash at temperature 0.2. The kappa table is pinned to that combination; Paper 04.1 (in pipeline) will replicate with Gemini Pro and at temperatures 0.0 and 0.4 to quantify sensitivity.
  • It does not test panel-of-LLMs vs single-LLM tagging. Whether ensembling three LLM raters lifts kappa above single-rater Gemini is an open question; the experimental design is drafted but the run is not yet funded.

Next steps. Full paper submission to Information Systems Research in Q3 2026. The replication package will include the codebook, the 500-ad ID list with public source URLs, the per-ad LLM and human tags, and the kappa computation notebooks. A skeptical reviewer should be able to reproduce every number on this page without our cooperation.

Asked, plainly.

What is the headline accuracy number, and why don't you just say "85%"?
Because "85%" is a marketing number and Cohen's κ is the honest number. Aggregate observed agreement across all 6,450 paired decisions is 0.847, which rounds to ~85%. But that aggregate hides the variance, some attribute classes show κ above 0.85 (logo presence, on-screen CTA, music tempo), some sit in the 0.60–0.75 band (emotion family, hook density, distinctive-asset count), and three sit below 0.55 (irony, MENA-dialect cultural ritual, multi-shot brand continuity). Publishing the headline without the κ table would be hiding the parts that need work.
Why is Cohen's kappa the right metric here?
Because raw percentage agreement overstates reliability whenever one class dominates. If 92% of ads have a visible logo, a coder who tags "logo present" on every ad gets 92% raw agreement while contributing zero diagnostic information. Cohen's κ corrects for chance agreement and is the standard metric for IRR studies in content analysis since Cohen (1960). For multi-class attributes we report Fleiss' κ across the three human coders.
Where does the LLM systematically fail?
Three documented failure families. One: irony detection, sarcasm carried by visual subversion of a verbal frame is read literally about 38% of the time. Two: MENA-dialect cultural ritual detection, Darija and Khaleeji ritual cues are missed against Egyptian Arabic baselines. Three: multi-shot brand-cue continuity, the LLM treats each shot independently and underweights cumulative brand-cue density. These are the three places we will not silently ship, they are flagged in the report's decision_trace.
Why does this study matter for buyers, not just for journal reviewers?
Because every claim SaliencyLab makes about ad structure, every KPI score, every benchmark percentile, every Sharpen vs Rebuild verdict, depends on the LLM tagging the ad's attributes correctly. If the tagging is unreliable, the score is decorated noise. The IRR study is the empirical floor under everything else on the platform. Buyers who skip this paper are trusting a number whose substructure has not been validated.
Who are the three human coders, and how were they trained?
Three trained creative analysts: one ex-agency creative strategist (15 years, FMCG), one academic researcher in marketing science (PhD candidate, Mohammed V Rabat), one platform-side performance creative lead (8 years, TikTok + Meta). All three received the 43-attribute codebook plus a 50-ad training cohort with adjudicated ground truth before the 500-ad study began. Coders worked independently, blinded to each other's tagging, and blinded to the LLM output.
When will the full paper be released?
Submission window: Q3 2026 to Information Systems Research. The pre-print on this page is the methodology + headline κ table. The full paper will include the per-attribute breakdown (all 43), the fifteen worked disagreement case studies, the codebook in appendix A, and the replication package on GitHub..