Back to blog
ScienceApril 4, 2026 · 6 min read

Visual attention prediction vs. eye tracking: what we can and cannot claim

RoastIQ uses TranSalNet to predict where the eye is likely to look. This is not eye tracking. Here is exactly what the model does, how accurate it is, and where it breaks down.

Visual attention prediction vs. eye tracking: what we can and cannot claim

What TranSalNet does

RoastIQ uses TranSalNet, a transformer-based saliency prediction model, to analyze each frame of your video. The model outputs a heatmap showing which areas of the frame are predicted to attract visual attention.

This is a prediction, not a measurement. No eyes were tracked. No participants watched your ad. The model was trained on large datasets of human fixation data and learned to predict where people are likely to look based on visual features like contrast, faces, text, and motion.

What it can tell you

  • Which areas of the frame are predicted to attract attention — useful for checking whether your brand logo, product, or CTA falls in a high-attention zone
  • How attention distributes across frames over time — the evidence timeline shows when attention peaks and drops
  • Whether key creative moments coincide with attention moments — if your product reveal happens in a low-attention frame, that is a structural problem

What it cannot tell you

  • What any specific person actually looked at — saliency prediction is probabilistic, not individual
  • Why someone looked at a specific area — the model predicts where, not why
  • Attention in context — the model sees the frame in isolation, not as part of a social media feed with surrounding content
  • Audio-driven attention shifts — the visual model does not account for sound cues that redirect gaze

How accurate is it?

TranSalNet reaches state-of-the-art performance on standard saliency benchmarks (MIT300, SALICON). In practical terms, the model correctly identifies the primary attention zone in approximately 85% of frames for standard advertising content.

This is not 85% accuracy in any absolute sense. It means the model agrees with aggregated human fixation data 85% of the time on benchmark datasets. Your specific creative may deviate, especially if it uses unusual visual techniques, rapid cuts, or text-heavy compositions.

How RoastIQ uses attention data

Attention prediction feeds into two KPIs:

  • Beat the Skip (25% weight): Attention score contributes 55% of the Beat the Skip calculation. High visual attention in the first 2-3 seconds supports a stronger Beat the Skip score.
  • Get Noticed (20% weight): Average ad viewed (attention) is one of three sub-KPI components.

The attention heatmap is also surfaced directly in the evidence timeline, so you can scrub through the video and see which frames are predicted to attract the most attention.

The honest framing

We call it visual attention prediction, not eye tracking, because that is what it is. The distinction matters for trust: if you present saliency predictions as measured eye tracking data, you overstate your evidence. If you present them as predictions trained on fixation data, you give the viewer the right calibration for how much to trust the signal.

RoastIQ treats attention prediction as one input among five. It is not the verdict. It is one piece of evidence that feeds into the diagnostic.

← Back to blogMethodology →

Related reading

ScienceThe Pre-Post Gap: Why creative testing after launch is too lateScienceThree layers of scoring: from raw signals to KPI families to creative verdictScienceFrom RoastIQ score to buyer objection: how context makes synthetic interviews sharper