OpenMHC Leaderboard

Track 2a · Imputation

#	Method	Type ▾ Neural Statistical	R ↓	S ↑	S_fair ↑	Activity ↑	Physio. ↑	Sleep ↑	Workout ↑	Semantic ↑	Fallback ↓	Submitter
Single-day imputation
1	LSM-2 (daily)	Neural	3.8	+61.4	+57.6	+40.0	+31.4	+90.0	+94.9	+30.2	0.0	OpenMHC team
2	Linear	Statistical	7.0	+21.5	+34.7	+4.5	+9.8	+62.6	+56.5	-0.8	0.0	OpenMHC team
3	BRITS	Neural	7.8	+6.8	-30.3	+18.8	-28.5	+39.0	+28.0	-5.7	0.0	OpenMHC team
4	DLinear	Neural	8.2	-5.7	+30.1	+29.3	-5.1	-11.1	+58.2	-45.9	0.0	OpenMHC team
5	LOCF (baseline)	Statistical	8.4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	OpenMHC team
6	Temporal mode	Statistical	10.0	-6.2	+55.9	+46.5	-0.7	-13.9	-69.4	-11.8	0.0	OpenMHC team
7	Temporal mean	Statistical	10.4	-31.2	-28.9	-30.7	-18.4	+59.9	-2.1	-93.0	0.0	OpenMHC team
8	Mode	Statistical	10.6	-27.3	+91.2	+46.5	-0.8	-380.7	-69.4	-12.0	0.0	OpenMHC team
9	TimesNet	Neural	10.9	-66.0	+6.2	+9.6	-18.6	-216.2	+0.4	-103.2	0.0	OpenMHC team
10	FEDformer	Neural	11.3	-53.7	+35.4	+28.9	-14.6	-214.6	-53.7	-67.7	0.0	OpenMHC team
11	Mean	Statistical	13.4	-119.7	+92.2	-36.3	-25.5	-380.7	-69.4	-149.8	0.0	OpenMHC team
Long-context imputation (≥ 7×1440 time steps)
1	LSM-2-Sparse (7-day)	Neural	3.3	+64.7	+68.2	+41.0	+34.6	+92.2	+95.7	+34.6	0.0	OpenMHC team
2	LSM-2 (7-day)	Neural	5.5	+46.9	+46.2	+16.7	+19.0	+85.6	+90.6	+8.7	0.0	OpenMHC team
3	Pers. temp. mean	Statistical	9.1	-7.7	-50.7	+1.1	-5.9	+58.9	+15.7	-49.5	0.0	OpenMHC team
4	DLinear (7-day)	Neural	9.5	-28.3	+10.2	+19.9	-2.7	-40.0	+22.9	-69.5	0.0	OpenMHC team
5	Pers. mode	Statistical	10.5	-26.1	+76.4	+46.6	+1.8	-383.1	-69.4	-10.5	0.0	OpenMHC team
6	Pers. mean	Statistical	13.3	-114.1	-26.7	-4.1	-12.6	-437.7	-140.0	-132.5	0.0	OpenMHC team

Metric legend — scores are computed live vs the LOCF (last-observation-carried-forward) baseline.
R (Average Rank) — mean cross-method rank across all masking-scenario × channel tasks; 1 = best (lower is better).
S (Skill Score) — overall % reduction in reconstruction error vs LOCF (paired per-user geometric mean across tasks); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs LOCF); higher = more equitable.
Activity / Physio. / Sleep / Workout — per-category skill on that sensor group's channels (activity = steps, distance, flights; physiology = heart rate, active energy; sleep = asleep / in-bed; workout = the 10 workout-type channels).
Semantic — skill on the three structured-gap masking scenarios (sleep gap, workout gap, intensity failure).
Fallback — % of imputed values substituted by the LOCF baseline when the method produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.

Track 2b · Forecasting

#	Method	Type ▾ Deep Learning Foundation Model Statistical	R ↓	S ↑	S_fair ↑	Activity ↑	Physio. ↑	Sleep ↑	Workout ↑	Submitter
1	Chronos-2 (fine-tuned)	Foundation Model	3.6	+37.6	-2.3	+30.7	+26.9	+63.9	+17.0	OpenMHC team
2	Chronos-2 (zero-shot)	Foundation Model	4.2	+36.4	-1.4	+30.5	+26.5	+62.3	+14.8	OpenMHC team
3	SegRNN	Deep Learning	4.4	+34.6	+11.3	+25.4	+20.8	+68.2	+2.5	OpenMHC team
4	DLinear	Deep Learning	4.6	+35.9	+17.9	+25.0	+16.5	+71.5	+5.5	OpenMHC team
5	Toto (fine-tuned, ctx4096)	Foundation Model	4.7	+30.9	-1.5	+29.5	+26.1	+46.1	+18.8	OpenMHC team
6	Toto (zero-shot, ctx4096)	Foundation Model	5.5	+26.8	-9.9	+29.2	+6.8	+50.5	+11.9	OpenMHC team
7	MixLinear	Deep Learning	5.7	+29.2	+11.5	+23.4	+13.4	+64.6	-7.2	OpenMHC team
8	AutoETS	Statistical	7.1	+14.3	-304.2	+0.6	-26.8	+37.6	+31.4	OpenMHC team
9	AutoARIMA	Statistical	7.6	+5.9	-21.0	-1.8	-9.0	+7.0	+24.0	OpenMHC team
10	Seasonal Naive (baseline)	Statistical	7.7	0.0	0.0	0.0	0.0	0.0	0.0	OpenMHC team

Metric legend — scores are computed live vs the Seasonal Naive baseline (24-hour-ahead forecasting; MAE on continuous channels, AUROC on binary).
R (Average Rank) — mean cross-method rank across channel tasks; 1 = best (lower is better).
S (Skill Score) — overall category-balanced % reduction in forecast error vs Seasonal Naive (paired per-user geometric mean); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs Seasonal Naive); higher = more equitable.
Activity / Physio. / Sleep / Workout — per-category skill on that sensor group's channels (activity = steps, distance, flights; physiology = heart rate, active energy; sleep = asleep / in-bed; workout = the 10 workout-type channels).
Fallback — % of forecasts substituted by the Seasonal Naive baseline when the model produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.

Track 1 · Predictive Tasks

#	Method	Type ▾ Convolutional Deep Learning Foundation Self-Supervised Statistical	R ↓	S ↑	S_fair ↑	Fallback ↓	Submitter
1	LSM-2	Self-Supervised	2.4	+15.0	+2.3	0.0	OpenMHC team
2	XGBoost	Statistical	3.4	+11.8	-0.2	0.0	OpenMHC team
3	MultiRocket	Convolutional	3.9	+7.1	+11.2	0.0	OpenMHC team
4	WBM	Self-Supervised	4.4	+4.0	-5.9	62.8	OpenMHC team
5	Linear (baseline)	Statistical	4.6	0.0	0.0	0.0	OpenMHC team
6	GRU-D	Deep Learning	5.3	+1.6	+5.5	0.0	OpenMHC team
7	Chronos-2	Foundation	5.9	-3.7	+7.4	0.0	OpenMHC team
8	Toto	Foundation	6.1	-5.2	+8.2	0.0	OpenMHC team

Metric legend — Track 1 predicts weekly health outcomes from 168-hour sensor embeddings; scores are computed vs the Linear baseline.
R (Average Rank) — mean cross-method rank across the outcome tasks; 1 = best (lower is better).
S (Skill Score) — category-balanced % improvement over Linear across tasks (per-task AUPRC / Spearman / Pearson, paired-bootstrap mean); higher is better.
S_fair (Fairness skill) — % reduction in the cross-subgroup error disparity (age group + sex, MAPD ratio vs Linear); higher = more equitable.
Fallback — % of test predictions substituted by the Linear baseline when the method produced no valid output (lower is better).
Source: MyHeartCounts/OpenMHC-leaderboard-data.

Submit your model

Add a method by opening a pull request on the OpenMHC leaderboard dataset that adds your per-user evaluation substrate (<track>/<method>.parquet) plus a small <method>.meta.json sidecar. Produce the substrate by running the OpenMHC eval with output_dir=…; the maintainers recompute the skill, fairness, and rank scores from it. See the step-by-step submission guide for the exact file schema.