OpenMHC Benchmark
Wearable & mobile health benchmark on MyHeartCounts. Track 2a (imputation) reconstructs masked daily, minute-level signals; Track 2b (forecasting) predicts future hourly signals. Each method is ranked by skill score vs a track baseline, computed live from the per-user evaluation substrate.
Track 2a · Imputation
| # | Method | Type ▾ | R ↓ | S ↑ | S_fair ↑ | Activity ↑ | Physio. ↑ | Sleep ↑ | Workout ↑ | Semantic ↑ | Fallback ↓ | Submitter |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Single-day imputation | ||||||||||||
| 1 | LSM-2 (daily) | Neural | 3.8 | +61.4 | +57.6 | +40.0 | +31.4 | +90.0 | +94.9 | +30.2 | 0.0 | OpenMHC team |
| 2 | Linear | Statistical | 7.0 | +21.5 | +34.7 | +4.5 | +9.8 | +62.6 | +56.5 | -0.8 | 0.0 | OpenMHC team |
| 3 | BRITS | Neural | 7.8 | +6.8 | -30.3 | +18.8 | -28.5 | +39.0 | +28.0 | -5.7 | 0.0 | OpenMHC team |
| 4 | DLinear | Neural | 8.2 | -5.7 | +30.1 | +29.3 | -5.1 | -11.1 | +58.2 | -45.9 | 0.0 | OpenMHC team |
| 5 | LOCF (baseline) | Statistical | 8.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | OpenMHC team |
| 6 | Temporal mode | Statistical | 10.0 | -6.2 | +55.9 | +46.5 | -0.7 | -13.9 | -69.4 | -11.8 | 0.0 | OpenMHC team |
| 7 | Temporal mean | Statistical | 10.4 | -31.2 | -28.9 | -30.7 | -18.4 | +59.9 | -2.1 | -93.0 | 0.0 | OpenMHC team |
| 8 | Mode | Statistical | 10.6 | -27.3 | +91.2 | +46.5 | -0.8 | -380.7 | -69.4 | -12.0 | 0.0 | OpenMHC team |
| 9 | TimesNet | Neural | 10.9 | -66.0 | +6.2 | +9.6 | -18.6 | -216.2 | +0.4 | -103.2 | 0.0 | OpenMHC team |
| 10 | FEDformer | Neural | 11.3 | -53.7 | +35.4 | +28.9 | -14.6 | -214.6 | -53.7 | -67.7 | 0.0 | OpenMHC team |
| 11 | Mean | Statistical | 13.4 | -119.7 | +92.2 | -36.3 | -25.5 | -380.7 | -69.4 | -149.8 | 0.0 | OpenMHC team |
| Long-context imputation (≥ 7×1440 time steps) | ||||||||||||
| 1 | LSM-2-Sparse (7-day) | Neural | 3.3 | +64.7 | +68.2 | +41.0 | +34.6 | +92.2 | +95.7 | +34.6 | 0.0 | OpenMHC team |
| 2 | LSM-2 (7-day) | Neural | 5.5 | +46.9 | +46.2 | +16.7 | +19.0 | +85.6 | +90.6 | +8.7 | 0.0 | OpenMHC team |
| 3 | Pers. temp. mean | Statistical | 9.1 | -7.7 | -50.7 | +1.1 | -5.9 | +58.9 | +15.7 | -49.5 | 0.0 | OpenMHC team |
| 4 | DLinear (7-day) | Neural | 9.5 | -28.3 | +10.2 | +19.9 | -2.7 | -40.0 | +22.9 | -69.5 | 0.0 | OpenMHC team |
| 5 | Pers. mode | Statistical | 10.5 | -26.1 | +76.4 | +46.6 | +1.8 | -383.1 | -69.4 | -10.5 | 0.0 | OpenMHC team |
| 6 | Pers. mean | Statistical | 13.3 | -114.1 | -26.7 | -4.1 | -12.6 | -437.7 | -140.0 | -132.5 | 0.0 | OpenMHC team |
Metric legend — scores are computed live vs the LOCF (last-observation-carried-forward) baseline.
R (Average Rank) — mean cross-method rank across all masking-scenario × channel tasks; 1 = best (lower is better).
S (Skill Score) — overall % reduction in reconstruction error vs LOCF (paired per-user geometric mean across tasks); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs LOCF); higher = more equitable.
Activity / Physio. / Sleep / Workout — per-category skill on that sensor group's channels (activity = steps, distance, flights; physiology = heart rate, active energy; sleep = asleep / in-bed; workout = the 10 workout-type channels).
Semantic — skill on the three structured-gap masking scenarios (sleep gap, workout gap, intensity failure).
Fallback — % of imputed values substituted by the LOCF baseline when the method produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.
Track 2b · Forecasting
| # | Method | Type ▾ | R ↓ | S ↑ | S_fair ↑ | Activity ↑ | Physio. ↑ | Sleep ↑ | Workout ↑ | Fallback ↓ | Submitter |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chronos-2 (fine-tuned) | Foundation Model | 3.6 | +37.6 | -2.3 | +30.7 | +26.9 | +63.9 | +17.0 | 0.0 | OpenMHC team |
| 2 | Chronos-2 (zero-shot) | Foundation Model | 4.2 | +36.4 | -1.4 | +30.5 | +26.5 | +62.3 | +14.8 | 0.0 | OpenMHC team |
| 3 | SegRNN | Deep Learning | 4.4 | +34.6 | +11.3 | +25.4 | +20.8 | +68.2 | +2.5 | 0.0 | OpenMHC team |
| 4 | DLinear | Deep Learning | 4.6 | +35.9 | +17.9 | +25.0 | +16.5 | +71.5 | +5.5 | 0.0 | OpenMHC team |
| 5 | Toto (fine-tuned, ctx4096) | Foundation Model | 4.7 | +30.9 | -1.5 | +29.5 | +26.1 | +46.1 | +18.8 | 0.0 | OpenMHC team |
| 6 | Toto (zero-shot, ctx4096) | Foundation Model | 5.5 | +26.8 | -9.9 | +29.2 | +6.8 | +50.5 | +11.9 | 0.0 | OpenMHC team |
| 7 | MixLinear | Deep Learning | 5.7 | +29.2 | +11.5 | +23.4 | +13.4 | +64.6 | -7.2 | 0.0 | OpenMHC team |
| 8 | AutoETS | Statistical | 7.1 | +14.3 | -304.2 | +0.6 | -26.8 | +37.6 | +31.4 | 0.0 | OpenMHC team |
| 9 | AutoARIMA | Statistical | 7.6 | +5.9 | -21.0 | -1.8 | -9.0 | +7.0 | +24.0 | 0.0 | OpenMHC team |
| 10 | Seasonal Naive (baseline) | Statistical | 7.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | OpenMHC team |
Metric legend — scores are computed live vs the Seasonal Naive baseline (24-hour-ahead forecasting; MAE on continuous channels, AUROC on binary).
R (Average Rank) — mean cross-method rank across channel tasks; 1 = best (lower is better).
S (Skill Score) — overall category-balanced % reduction in forecast error vs Seasonal Naive (paired per-user geometric mean); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs Seasonal Naive); higher = more equitable.
Activity / Physio. / Sleep / Workout — per-category skill on that sensor group's channels (activity = steps, distance, flights; physiology = heart rate, active energy; sleep = asleep / in-bed; workout = the 10 workout-type channels).
Fallback — % of forecasts substituted by the Seasonal Naive baseline when the model produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.
Track 1 · Predictive Tasks
| # | Method | Type ▾ | R ↓ | S ↑ | S_fair ↑ | Fallback ↓ | Submitter |
|---|---|---|---|---|---|---|---|
| 1 | LSM-2 | Self-Supervised | 2.4 | +15.0 | +2.3 | 0.0 | OpenMHC team |
| 2 | XGBoost | Statistical | 3.4 | +11.8 | -0.2 | 0.0 | OpenMHC team |
| 3 | MultiRocket | Convolutional | 3.9 | +7.1 | +11.2 | 0.0 | OpenMHC team |
| 4 | WBM | Self-Supervised | 4.4 | +4.0 | -5.9 | 62.8 | OpenMHC team |
| 5 | Linear (baseline) | Statistical | 4.6 | 0.0 | 0.0 | 0.0 | OpenMHC team |
| 6 | GRU-D | Deep Learning | 5.3 | +1.6 | +5.5 | 0.0 | OpenMHC team |
| 7 | Chronos-2 | Foundation | 5.9 | -3.7 | +7.4 | 0.0 | OpenMHC team |
| 8 | Toto | Foundation | 6.1 | -5.2 | +8.2 | 0.0 | OpenMHC team |
Metric legend — Track 1 predicts weekly health outcomes from 168-hour sensor embeddings; scores are computed vs the Linear baseline.
R (Average Rank) — mean cross-method rank across the outcome tasks; 1 = best (lower is better).
S (Skill Score) — category-balanced % improvement over Linear across tasks (per-task AUPRC / Spearman / Pearson, paired-bootstrap mean); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs Linear); higher = more equitable.
Fallback — % of test predictions substituted by the Linear baseline when the method produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.
Submit your model
Add a method by opening a pull request on the
OpenMHC leaderboard dataset
that adds your per-user evaluation substrate
(<track>/<method>.parquet) plus a small
<method>.meta.json sidecar. Produce the substrate by running the OpenMHC
eval with output_dir=…; the maintainers recompute the skill, fairness,
and rank scores from it. See the
step-by-step submission guide
for the exact file schema.