accuracy

How the model’s predictions have compared to actual eBird checklist diversity at Wendy Park. 79 days, 2024-04-26 → 2025-06-05.

Disclosure.The live 90-day score window (spring 2026) doesn’t overlap the eBird history snapshot (ends Jun 2025) yet. Rows below are scored by backtest.py against the same v0.2.0 rubric the live model uses, then joined to actual eBird counts for spring 2024 + 2025. Pearson r is computed live from the joined data, not pulled from the changelog.

predicted vs actual

DEFINITELY_GOGOMARGINALSKIP

Pearson r(predicted, actual) = +0.707 (79 days, days with zero checklists excluded)

calibration by verdict bin

Each row is one verdict bucket. Mean / median / range of actual eBird species on days the model placed there. A useful model walks mean species down monotonically as the verdict gets gloomier.

verdict	n	mean spp	median	range
DEFINITELY_GO	15	77.6	78.0	36–115
GO	33	66.1	66.0	30–106
MARGINAL	28	43.8	37.5	17–82
SKIP

where the model misses

Biggest disagreements between the model and the day’s actual checklist diversity, ranked by standardized residual.

under-predictions (model said low, actual was high)

date	predicted	verdict	actual	amplifiers / vetos
2024-05-18	6.74	GO	106	—
2025-04-29	5.37	MARGINAL	82	—
2025-05-21	4.92	MARGINAL	74	—

how to read this

Pearson r between roughly +0.4 and +0.8 means the model orders days correctly more often than not — useful for ranking, not for predicting an exact species count.

Verdict bins are honest if mean species walks downward from DEFINITELY_GO to SKIP. A SKIP bucket that out-births GO is a sign the veto layer is over-eager.

Misses are the right place to look first when something feels off — they’re the days a future calibration pass needs to either explain or absorb.

date	predicted	verdict	actual	amplifiers / vetos
2024-05-12	8.03	DEFINITELY_GO	36	+peak_week_geometry
2025-05-12	8.27	DEFINITELY_GO	56	+peak_week_geometry
2025-05-18	6.47	GO	31	—
2024-05-16	7.84	DEFINITELY_GO	52	+peak_week_geometry
2024-04-26	5.36	MARGINAL	17	—

under-predictions (model said low, actual was high)

over-predictions (model said high, actual was low)