🤖 Robot Policy Evaluation Harness

Bayesian statistics · SPARC smoothness · STL safety constraints

Based on Kress-Gazit et al. (TRI/Cornell) · arXiv:2409.09491

Dataset: phail-anon/phail-v1.0 — 20 stratified episodes from 4 real VLA policies running autonomously on a Franka Research 3 robot. No GPU needed — we're scoring pre-recorded rollouts, not running the policies.

Policy	Type	Developer
ACT	Action Chunking Transformer	Academic (Chi et al.)
GR00T N1.6	Foundation model	NVIDIA
π0.5	Diffusion policy VLA	Physical Intelligence
SmolVLA	Compact VLA	HuggingFace

Task: bin-to-bin pick-and-place (batteries, scissors, towels, wooden spoons). Success labels are human-verified from gripper telemetry.

Bayesian Posteriors

P(row beats col)

SPARC Smoothness

Speed Profiles

STL Robustness

Violations

Composite Radar

Final Ranking

Dataset: lerobot/aloha_static_cups_open — 50 real episodes of a bimanual ALOHA robot opening a cup lid, collected via human teleoperation. Adjust the sliders to configure policies, then run the analysis.

Policy A

Episodes

5 20

Success rate (%)

10 100

Policy B

Episodes

5 20

Success rate (%)

10 100

Policy C

Episodes

3 10

Success rate (%)

10 100

Bayesian Posteriors

P(row beats col)

SPARC Smoothness

Speed Profiles

STL Robustness

Violations

Composite Radar

Final Ranking

Try a real dataset — or upload your own rollouts

Step 1 — pick a real robot dataset to download as a ready-to-use CSV:

Dataset	Robot	DOF	Task
ALOHA bimanual	Stanford ALOHA (2× ViperX)	14	Cup opening
Push-T real	Columbia delta robot	8	Push block to goal
Franka Panda	NYU Franka Emika Panda	13	Free-play manipulation
Unitree H1	Full-size humanoid	19 state / 40 action	Warehouse pick-place

Then upload the downloaded CSV below and hit Analyse.

Real robot dataset

Downloaded CSV (upload below ↓)

Template

Expected CSV format:

episode_id, policy_name, success, state_0, state_1, ..., state_N, action_0, action_1, ..., action_N
0, A, 0, -0.001, -0.963, 1.173, ...
0, A, 0, -0.013, -0.952, 1.168, ...
...
0, A, 1, ...   ← last frame: success=1
1, B, 0, ...

episode_id: integer, groups frames into one rollout
policy_name: string, used to group into comparison groups (omit for single-policy)
success: 0 or 1 (use the value on the last frame of the episode)
state_N: joint position at each timestep (any number of joints)
action_N: commanded joint position (optional — if absent, effort will be zero)

⬆ Upload rollout CSV (your own or downloaded above)

Bayesian Posteriors

P(row beats col)

SPARC Smoothness

Speed Profiles

STL Robustness

Violations

Composite Radar

Final Ranking

What this is

A lightweight evaluation harness for robot manipulation policies, based on best practices from Kress-Gazit et al. (TRI / Cornell), arXiv:2409.09491.

The field almost universally reports bare success rate from a handful of trials with no statistical analysis. This tool replaces that with three complementary methods:

① Bayesian Bernoulli Analysis

Models each policy's success probability as a Beta distribution rather than a point estimate. Shows the full posterior, 95% credible interval lower bound, and the probability that one policy is genuinely better than another — not just luckier.

"P(A > B) = 0.83" is very different from "A scored 80%, B scored 60%".

② SPARC Smoothness

Computes the SPectral ARC length of the robot's joint-space speed profile. Two policies can have identical success rates but completely different motion quality. A policy that succeeds jerkily is unsafe near people and hard on hardware.

③ STL Safety Constraints

Encodes behavioral requirements as Signal Temporal Logic formulas and automatically scores every rollout — no human video review required.

Example: "Whenever the robot is straining (high tracking error), the arm must stay above table height."

Uploading your own data

Any robot with joint-position logging works. The CSV format is:

episode_id, policy_name, success, state_0 … state_N, action_0 … action_N

Citation

@article{kressgazit2024robot,
  title   = {Robot Learning as an Empirical Science},
  author  = {Kress-Gazit, Hadas and others},
  journal = {arXiv preprint arXiv:2409.09491},
  year    = {2024}
}

Built with Gradio logo