We benchmarked 8 vision-language models across 4 robot & human-interaction tasks, then built an inference method that routes work intelligently — matching frontier accuracy while spending a third of the budget. The same pipeline powers a second-level, audited annotation deliverable.
On the Cosmos RoboFail benchmark, our agreement-gated cascade (M13) lets free open-source models handle the unambiguous clips and only escalates the hard ones to a frontier model. The result sits in the top-left "win" zone: frontier-level accuracy at ~⅓ the API spend.
No single model wins everywhere — evaluation is task-conditioned. Frontier models lead on image MCQ and dense temporal grounding; open-source models surprisingly lead on free-form RoboVQA. Our method picks the right tool per task.
| Model | EPIC dense (IoU / label) | RoboVQA | Cosmos RoboFail | Robo2VLM |
|---|---|---|---|---|
| Gemini 3.1 Pro | 52.4 / 48.9 | 41.1 | 69.0 | 65.8 |
| Gemini 3-Flash | 50.8 / 38.7 | 39.8 | 67.0 | 67.1 |
| Gemini 2.5 Pro | 55.3 / 35.2 | 38.9 | 61.0 | — |
| Cosmos-Reason1-7B | 11.1 / 10.9 | 55.6 | 61.0 | 41.7 |
| Qwen-2.5-VL-7B | 16.0 / 13.4 | 50.7 | 65.0 | 46.8 |
| InternVL3-8B | 21.8 / 17.9 | 47.3 | 63.6 | 45.5 |
Bold-cell winners vary by column. Full numbers, significance markers and methodology in the technical report. EPIC n=15 (segment-level n=140), RoboVQA n=63, Cosmos n=100, Robo2VLM n=345.
Every long clip is segmented at second-level granularity with a 16-field bilingual schema — motion phase, contact state, object state-change, sub-goal, spatial relations and a 2–4 sentence description per segment. A second model pass scores confidence and flags only disagreements for human review.
Cheap 7B models have a hard perception ceiling that prompting can't lift (proven across 3 model families). Our M13 cascade routes the easy clips to free models and only buys frontier inference on the hard ones.
Measured: Gemini-3 video input is cheap; the spend is output + hidden reasoning tokens. Tuning that knob cut annotation cost 57% with zero quality loss.
A segment-length cap + per-cycle rule fixes lazy under-segmentation of long clips far better — and cheaper — than paying for more model thinking.
A second independent model pass flags only the disagreements for human review, turning a black-box API call into an auditable, confidence-scored deliverable.