🤖 Robot Policy Evaluation Harness
Bayesian statistics · SPARC smoothness · STL safety constraints
Based on Kress-Gazit et al. (TRI/Cornell) · arXiv:2409.09491
Dataset: phail-anon/phail-v1.0
— 20 stratified episodes from 4 real VLA policies running autonomously on a Franka Research 3 robot.
No GPU needed — we're scoring pre-recorded rollouts, not running the policies.
| Policy | Type | Developer |
|---|---|---|
| ACT | Action Chunking Transformer | Academic (Chi et al.) |
| GR00T N1.6 | Foundation model | NVIDIA |
| π0.5 | Diffusion policy VLA | Physical Intelligence |
| SmolVLA | Compact VLA | HuggingFace |
Task: bin-to-bin pick-and-place (batteries, scissors, towels, wooden spoons). Success labels are human-verified from gripper telemetry.
Dataset: lerobot/aloha_static_cups_open
— 50 real episodes of a bimanual ALOHA robot opening a cup lid, collected via human teleoperation.
Adjust the sliders to configure policies, then run the analysis.
Policy A
Policy B
Policy C
Try a real dataset — or upload your own rollouts
Step 1 — pick a real robot dataset to download as a ready-to-use CSV:
| Dataset | Robot | DOF | Task |
|---|---|---|---|
| ALOHA bimanual | Stanford ALOHA (2× ViperX) | 14 | Cup opening |
| Push-T real | Columbia delta robot | 8 | Push block to goal |
| Franka Panda | NYU Franka Emika Panda | 13 | Free-play manipulation |
| Unitree H1 | Full-size humanoid | 19 state / 40 action | Warehouse pick-place |
Then upload the downloaded CSV below and hit Analyse.
Expected CSV format:
episode_id, policy_name, success, state_0, state_1, ..., state_N, action_0, action_1, ..., action_N
0, A, 0, -0.001, -0.963, 1.173, ...
0, A, 0, -0.013, -0.952, 1.168, ...
...
0, A, 1, ... ← last frame: success=1
1, B, 0, ...
episode_id: integer, groups frames into one rolloutpolicy_name: string, used to group into comparison groups (omit for single-policy)success: 0 or 1 (use the value on the last frame of the episode)state_N: joint position at each timestep (any number of joints)action_N: commanded joint position (optional — if absent, effort will be zero)
What this is
A lightweight evaluation harness for robot manipulation policies, based on best practices from Kress-Gazit et al. (TRI / Cornell), arXiv:2409.09491.
The field almost universally reports bare success rate from a handful of trials with no statistical analysis. This tool replaces that with three complementary methods:
① Bayesian Bernoulli Analysis
Models each policy's success probability as a Beta distribution rather than a point estimate. Shows the full posterior, 95% credible interval lower bound, and the probability that one policy is genuinely better than another — not just luckier.
"P(A > B) = 0.83" is very different from "A scored 80%, B scored 60%".
② SPARC Smoothness
Computes the SPectral ARC length of the robot's joint-space speed profile. Two policies can have identical success rates but completely different motion quality. A policy that succeeds jerkily is unsafe near people and hard on hardware.
③ STL Safety Constraints
Encodes behavioral requirements as Signal Temporal Logic formulas and automatically scores every rollout — no human video review required.
Example: "Whenever the robot is straining (high tracking error), the arm must stay above table height."
Uploading your own data
Any robot with joint-position logging works. The CSV format is:
episode_id, policy_name, success, state_0 … state_N, action_0 … action_N
Citation
@article{kressgazit2024robot,
title = {Robot Learning as an Empirical Science},
author = {Kress-Gazit, Hadas and others},
journal = {arXiv preprint arXiv:2409.09491},
year = {2024}
}