OpenMHC Benchmark

OpenMHC Benchmark

OpenMHC Leaderboard

Wearable & mobile health benchmark on MyHeartCounts. Track 2a (imputation) reconstructs masked daily, minute-level signals; Track 2b (forecasting) predicts future hourly signals. Each method is ranked by skill score vs a track baseline, computed live from the per-user evaluation substrate.

📤 Submit a model ⚙️ Code 📊 Dataset · coming soon 📄 Paper · coming soon 🏠 MyHeartCounts 🤗 Models

Track 2a · Imputation

#MethodType R ↓S ↑S_fair ↑Activity ↑Physio. ↑Sleep ↑Workout ↑Semantic ↑Fallback ↓Submitter
Single-day imputation
1LSM-2 (daily)Neural3.8+61.4+57.6+40.0+31.4+90.0+94.9+30.20.0OpenMHC team
2LinearStatistical7.0+21.5+34.7+4.5+9.8+62.6+56.5-0.80.0OpenMHC team
3BRITSNeural7.8+6.8-30.3+18.8-28.5+39.0+28.0-5.70.0OpenMHC team
4DLinearNeural8.2-5.7+30.1+29.3-5.1-11.1+58.2-45.90.0OpenMHC team
5LOCF (baseline)Statistical8.40.00.00.00.00.00.00.00.0OpenMHC team
6Temporal modeStatistical10.0-6.2+55.9+46.5-0.7-13.9-69.4-11.80.0OpenMHC team
7Temporal meanStatistical10.4-31.2-28.9-30.7-18.4+59.9-2.1-93.00.0OpenMHC team
8ModeStatistical10.6-27.3+91.2+46.5-0.8-380.7-69.4-12.00.0OpenMHC team
9TimesNetNeural10.9-66.0+6.2+9.6-18.6-216.2+0.4-103.20.0OpenMHC team
10FEDformerNeural11.3-53.7+35.4+28.9-14.6-214.6-53.7-67.70.0OpenMHC team
11MeanStatistical13.4-119.7+92.2-36.3-25.5-380.7-69.4-149.80.0OpenMHC team
Long-context imputation (≥ 7×1440 time steps)
1LSM-2-Sparse (7-day)Neural3.3+64.7+68.2+41.0+34.6+92.2+95.7+34.60.0OpenMHC team
2LSM-2 (7-day)Neural5.5+46.9+46.2+16.7+19.0+85.6+90.6+8.70.0OpenMHC team
3Pers. temp. meanStatistical9.1-7.7-50.7+1.1-5.9+58.9+15.7-49.50.0OpenMHC team
4DLinear (7-day)Neural9.5-28.3+10.2+19.9-2.7-40.0+22.9-69.50.0OpenMHC team
5Pers. modeStatistical10.5-26.1+76.4+46.6+1.8-383.1-69.4-10.50.0OpenMHC team
6Pers. meanStatistical13.3-114.1-26.7-4.1-12.6-437.7-140.0-132.50.0OpenMHC team

Metric legend — scores are computed live vs the LOCF (last-observation-carried-forward) baseline.
R (Average Rank) — mean cross-method rank across all masking-scenario × channel tasks; 1 = best (lower is better).
S (Skill Score) — overall % reduction in reconstruction error vs LOCF (paired per-user geometric mean across tasks); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs LOCF); higher = more equitable.
Activity / Physio. / Sleep / Workout — per-category skill on that sensor group's channels (activity = steps, distance, flights; physiology = heart rate, active energy; sleep = asleep / in-bed; workout = the 10 workout-type channels).
Semantic — skill on the three structured-gap masking scenarios (sleep gap, workout gap, intensity failure).
Fallback — % of imputed values substituted by the LOCF baseline when the method produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.

Submit your model

Add a method by opening a pull request on the OpenMHC leaderboard dataset that adds your per-user evaluation substrate (<track>/<method>.parquet) plus a small <method>.meta.json sidecar. Produce the substrate by running the OpenMHC eval with output_dir=…; the maintainers recompute the skill, fairness, and rank scores from it. See the step-by-step submission guide for the exact file schema.