RVAB · Robot Video Annotation Bench

Frontier-quality video annotation
at a fraction of the cost

We benchmarked 8 vision-language models across 4 robot & human-interaction tasks, then built an inference method that routes work intelligently — matching frontier accuracy while spending a third of the budget. The same pipeline powers a second-level, audited annotation deliverable.

▶ Open the annotation studio 📊 Benchmark: image-first vs video ▶ Somantis pilot · 15 clips ▶ sample01 · 3 GoPro clips Explore the benchmark ↓ See the results ↓

$3.04

per video-hour, measured (G3.1-Pro dense)

72.7%

accuracy @ 35% of frontier cost

1,552

labeled segments across 20 clips, second-level

structured fields per segment, EN + 中文

The headline result

Cheaper and better — not a trade-off

On the Cosmos RoboFail benchmark, our agreement-gated cascade (M13) lets free open-source models handle the unambiguous clips and only escalates the hard ones to a frontier model. The result sits in the top-left "win" zone: frontier-level accuracy at ~⅓ the API spend.

open-source (free) frontier solo our method max-accuracy ensemble

Why it works: when cheap models agree, the clip is usually visually unambiguous — and their consensus (70.3%) actually beats a frontier model solo (64.1%) on exactly those clips. We only pay for frontier inference on the 35% of clips where models disagree. Statistically validated with paired-bootstrap CIs.

The benchmark

8 models · 4 tasks · 36 findings

No single model wins everywhere — evaluation is task-conditioned. Frontier models lead on image MCQ and dense temporal grounding; open-source models surprisingly lead on free-form RoboVQA. Our method picks the right tool per task.

Model	EPIC dense (IoU / label)	RoboVQA	Cosmos RoboFail	Robo2VLM
Gemini 3.1 Pro	52.4 / 48.9	41.1	69.0	65.8
Gemini 3-Flash	50.8 / 38.7	39.8	67.0	67.1
Gemini 2.5 Pro	55.3 / 35.2	38.9	61.0	—
Cosmos-Reason1-7B	11.1 / 10.9	55.6	61.0	41.7
Qwen-2.5-VL-7B	16.0 / 13.4	50.7	65.0	46.8
InternVL3-8B	21.8 / 17.9	47.3	63.6	45.5

Bold-cell winners vary by column. Full numbers, significance markers and methodology in the technical report. EPIC n=15 (segment-level n=140), RoboVQA n=63, Cosmos n=100, Robo2VLM n=345.

The deliverable

Second-level dense annotation, audited

Every long clip is segmented at second-level granularity with a 16-field bilingual schema — motion phase, contact state, object state-change, sub-goal, spatial relations and a 2–4 sentence description per segment. A second model pass scores confidence and flags only disagreements for human review.

$3.04/hr

measured cost per video-hour (total $2.20 for 43.4 min)

4.9s

mean segment length (second-level)

25.4

words of description per segment (avg)

16 fields

structured labels per segment, EN + 中文

00:09.5 – 00:11.5

insertin-contact

insert strap into buckle

The right hand aligns the buckle's slot with the strap end and pushes it through; the left hand keeps the strap taut. state: strap separate → partially threaded

00:11.5 – 00:14.0

manipulatein-contact

pull strap taut through buckle

Both hands draw the strap through until the buckle seats against the loop. sub-goal: fasten the buckle

▶ Scrub the live annotations

Pilot · external 15-clip set (试采15条)

Same pipeline, fresh headset capture

15 egocentric clips (~40 min total) from an Apple-Vision-Pro-style headset rig — stereo HEVC RGB + 26-joint dual-hand tracking + head pose + IMU. Annotated end-to-end with Gemini 3.1 Pro at 100 % timeline coverage, using the same dense-temporal prompt + sanitiser as the main deliverable, with empirical hand-active intervals lifted straight from the headset's own CSV feed as a sensor-side sanity check.

15 / 15

clips at 100 % timeline coverage

815 segments

dense, bilingual, 16-field schema

$5.69

total API spend ($9.35 / video-hour)

3 subjects

kitchen / office / tea-ceremony activities

▶ Open the Somantis pilot studio

Pilot · sample01 (3 GoPro clips, ~30 min)

Same pipeline, raw GoPro footage

3 GoPro 1080p/60fps videos (~30 minutes total) covering hand-washing, cardboard-box assembly, and ventilation-fan assembly. End-to-end on the video-mode Gemini 3.1 Pro path with the same dense-temporal prompt — 100% timeline coverage, bilingual EN+ZH 16-field schema, ~$3.32 / video-hour measured.

3 / 3

clips at 100% timeline coverage

373 segments

dense, bilingual, 16-field schema

$1.63

total spend (~$3.32 / video-hour)

3 tasks

washing · box assembly · fan assembly

▶ Open the sample01 studio

What makes it different

Four ideas behind the numbers

Route, don't prompt

Cheap 7B models have a hard perception ceiling that prompting can't lift (proven across 3 model families). Our M13 cascade routes the easy clips to free models and only buys frontier inference on the hard ones.

Cost is in the thinking, not the pixels

Measured: Gemini-3 video input is cheap; the spend is output + hidden reasoning tokens. Tuning that knob cut annotation cost 57% with zero quality loss.

Granularity is a constraint, not a budget

A segment-length cap + per-cycle rule fixes lazy under-segmentation of long clips far better — and cheaper — than paying for more model thinking.

Two-pass QA you can trust

A second independent model pass flags only the disagreements for human review, turning a black-box API call into an auditable, confidence-scored deliverable.

Frontier-quality video annotationat a fraction of the cost