RVAB · Robot Video Annotation Bench

Frontier-quality video annotation
at a fraction of the cost

We benchmarked 8 vision-language models across 4 robot & human-interaction tasks, then built an inference method that routes work intelligently — matching frontier accuracy while spending a third of the budget. The same pipeline powers a second-level, audited annotation deliverable.

$3.04
per video-hour, measured (G3.1-Pro dense)
72.7%
accuracy @ 35% of frontier cost
530
labeled segments across 14 clips, second-level
16
structured fields per segment, EN + 中文
The headline result

Cheaper and better — not a trade-off

On the Cosmos RoboFail benchmark, our agreement-gated cascade (M13) lets free open-source models handle the unambiguous clips and only escalates the hard ones to a frontier model. The result sits in the top-left "win" zone: frontier-level accuracy at ~⅓ the API spend.

64% 66% 68% 70% 72% 74% 76% 0.5× 1.0× 1.5× 2.0× frontier API calls per clip → cost accuracy ▲ cheaper + better Qwen-2.5-VL-7B (open, free) 65.7% @ 0.0× Gemini 3.1 Pro (solo) 68.7% @ 1.0× ★ Ours: M13 cascade 72.7% @ 0.35× ★ Ours: M13 + ensemble 73.7% @ 0.71× Full ensemble (max acc) 75.8% @ 2.0×
open-source (free) frontier solo our method max-accuracy ensemble
Why it works: when cheap models agree, the clip is usually visually unambiguous — and their consensus (70.3%) actually beats a frontier model solo (64.1%) on exactly those clips. We only pay for frontier inference on the 35% of clips where models disagree. Statistically validated with paired-bootstrap CIs.
The benchmark

8 models · 4 tasks · 36 findings

No single model wins everywhere — evaluation is task-conditioned. Frontier models lead on image MCQ and dense temporal grounding; open-source models surprisingly lead on free-form RoboVQA. Our method picks the right tool per task.

ModelEPIC dense
(IoU / label)
RoboVQA Cosmos
RoboFail
Robo2VLM
Gemini 3.1 Pro52.4 / 48.941.169.065.8
Gemini 3-Flash50.8 / 38.739.867.067.1
Gemini 2.5 Pro55.3 / 35.238.961.0
Cosmos-Reason1-7B11.1 / 10.955.661.041.7
Qwen-2.5-VL-7B16.0 / 13.450.765.046.8
InternVL3-8B21.8 / 17.947.363.645.5

Bold-cell winners vary by column. Full numbers, significance markers and methodology in the technical report. EPIC n=15 (segment-level n=140), RoboVQA n=63, Cosmos n=100, Robo2VLM n=345.

The deliverable

Second-level dense annotation, audited

Every long clip is segmented at second-level granularity with a 16-field bilingual schema — motion phase, contact state, object state-change, sub-goal, spatial relations and a 2–4 sentence description per segment. A second model pass scores confidence and flags only disagreements for human review.

$3.04/hr
measured cost per video-hour (total $2.20 for 43.4 min)
4.9s
mean segment length (second-level)
25.4
words of description per segment (avg)
16 fields
structured labels per segment, EN + 中文
00:09.5 – 00:11.5
insertin-contact
insert strap into buckle
The right hand aligns the buckle's slot with the strap end and pushes it through; the left hand keeps the strap taut. state: strap separate → partially threaded
00:11.5 – 00:14.0
manipulatein-contact
pull strap taut through buckle
Both hands draw the strap through until the buckle seats against the loop. sub-goal: fasten the buckle
▶ Scrub the live annotations
What makes it different

Four ideas behind the numbers

Route, don't prompt

Cheap 7B models have a hard perception ceiling that prompting can't lift (proven across 3 model families). Our M13 cascade routes the easy clips to free models and only buys frontier inference on the hard ones.

Cost is in the thinking, not the pixels

Measured: Gemini-3 video input is cheap; the spend is output + hidden reasoning tokens. Tuning that knob cut annotation cost 57% with zero quality loss.

Granularity is a constraint, not a budget

A segment-length cap + per-cycle rule fixes lazy under-segmentation of long clips far better — and cheaper — than paying for more model thinking.

Two-pass QA you can trust

A second independent model pass flags only the disagreements for human review, turning a black-box API call into an auditable, confidence-scored deliverable.