🤖 Robot Policy Evaluation Harness

Bayesian statistics · SPARC smoothness · STL safety constraints

Based on Kress-Gazit et al. (TRI/Cornell) · arXiv:2409.09491

Dataset: phail-anon/phail-v1.0 — 20 stratified episodes from 4 real VLA policies running autonomously on a Franka Research 3 robot. No GPU needed — we're scoring pre-recorded rollouts, not running the policies.

Policy Type Developer
ACT Action Chunking Transformer Academic (Chi et al.)
GR00T N1.6 Foundation model NVIDIA
π0.5 Diffusion policy VLA Physical Intelligence
SmolVLA Compact VLA HuggingFace

Task: bin-to-bin pick-and-place (batteries, scissors, towels, wooden spoons). Success labels are human-verified from gripper telemetry.