Backfill College Data
Operational runbook
Sources and cadence
| Source | Convex action | Cron | Update frequency |
|---|---|---|---|
| College Scorecard | features/colleges/actions:scorecardBackfill | Monthly, 1st of month 06:00 UTC | Annual upstream (safe to re-run monthly) |
| Urban EADA (athletics) | features/colleges/actions:fetchAndCacheAthleticsForCollege | None — manual trigger per school, or run per-school loop | Annual upstream (~5yr physical dataset, monthly is safe) |
| EPA Walkability | features/colleges/actions:fetchAndCacheWalkabilityForCollege | None — manual trigger per school | ~5yr upstream release cycle |
| Urban IPEDS (app fee) | features/colleges/actions:fetchAndCacheIpedsForCollege | None — manual trigger per school | Annual IPEDS release |
The Scorecard backfill is the primary dataset. EADA, EPA, and IPEDS patch additional blocks onto existing colleges rows — they require the Scorecard row to exist first.
Initial deploy: run all three in order
Run these in sequence. Each step depends on the colleges rows from the prior step.
Step 1: Scorecard backfill (run first)
npx convex run features/colleges/actions:scorecardBackfillThis paginates the full Scorecard dataset (operating institutions, predominant degrees 0–4 post-Unit 29, size 10+), scheduling an upsert mutation per school via ctx.scheduler.runAfter. The upsert uses a content-hash check — unchanged rows are skipped. Every upsert also calls categorizeFromRawRow (Plan Unit 9) and stamps primaryCategory, categoryVersion, contentVersion, tags, parentUnitId, isBranch, state (top-level promoted). Atomic aggregate fan-out into aggregateWorkforce / aggregateTransferOutcome happens in the same transaction when the category matches.
Expected output post-Unit 29 universe expansion: { scheduled: ~6200, pages: ~62 }. The action itself returns quickly; background scheduled mutations finish within ~5 minutes for monthly cron, longer on first universe-expansion run (~10 min). Watch the Convex dashboard Functions → Scheduled tab to see the queue drain.
Pre-deploy on rehearsal: count category re-assignments. If existing 4,336 rows produce > 50 re-categorizations against your last prod snapshot, abort and investigate the categorize tree before letting the prod cron run.
App Store window: universe expansion (Unit 29) MUST NOT land within 72h of an App Store submission per Plan Risks table. Reviewer testing during a backfill produces flaky empty states.
Do not proceed to steps 2–3 until the scheduled upserts are done (queue is empty).
Step 2: EADA athletics backfill
EADA does not have a dedicated bulk backfill action yet. Run per-school via the existing fetchAndCacheAthleticsForCollege action, or use the runDerivedBackfill pattern as a model if you need to add a bulk action.
For an initial full backfill, the recommended approach is to add a bulk action (eadaBackfill) in features/colleges/actions.ts following the scorecardBackfill pattern, then run:
# When eadaBackfill action exists:
npx convex run features/colleges/actions:eadaBackfillExpected wall time for full dataset: ~57 minutes (Urban EADA API is rate-limited; the action cache handles re-runs cheaply after first pass).
Step 3: EPA walkability backfill
Walkability requires lat/lon to be present on the college doc (populated by Scorecard — hence the ordering dependency). Same pattern as EADA:
# When epaBackfill action exists:
npx convex run features/colleges/actions:epaBackfillExpected wall time for full dataset: ~34 minutes (EPA NatWalkInd API, block-group resolution). The walkabilityCache TTL is 180 days — re-runs within that window return cached values and are very fast.
Monthly cron details
The Scorecard refresh cron is declared in packages/backend/convex/crons.ts:
crons.monthly(
"scorecard refresh",
{ day: 1, hourUTC: 6 },
internal.features.colleges.actions.scorecardBackfill,
{},
);Fires at 06:00 UTC on the 1st of each month. This covers the annual Scorecard data refresh (released in late calendar year). The content-hash check on upsert means months with no upstream changes are essentially no-ops (hashes match → no DB write).
To verify the cron ran: Open Convex dashboard → Cron Jobs → "scorecard refresh" → Last Run. Check the backfillMetadata table for the source: "scorecard" row's lastRunAt field.
To temporarily disable: Comment out the crons.monthly(...) block in crons.ts and deploy. Re-enable by reverting and re-deploying. Do not delete the cron definition — that requires an explicit Convex deployment step to take effect.
Failure recovery
All three backfill actions use the chunked-resumable pattern from features/colleges/internal.ts:backfillDerivedFields as the reference implementation. The pattern:
- Actions paginate via
ctx.scheduler.runAfter(Scorecard) or cursor-based loops (derived, EADA, EPA) - Each chunk is an independent mutation — partial failures leave already-processed rows intact
- Re-running from the top is safe — the
upserthash check and per-schoolathleticsCache/walkabilityCache/ipedsCache(all backed by@convex-dev/action-cache) skip no-op work
To resume a partially failed Scorecard backfill:
Simply re-run. The Scorecard action re-pages from the start, but the content-hash check on upsert skips rows that are already current. Already-cached schools are fast (ActionCache TTL: 24h for Scorecard).
To check if a school was backfilled:
npx convex run features/colleges/internal:byUnitId '{"unitId":110635}'Look for updatedAt timestamp and the presence of athletics.fetchedAt / walkability.fetchedAt / ipeds.fetchedAt fields.
To check last backfill run per source:
Query the backfillMetadata table via the Convex dashboard (Tables → backfillMetadata). Each source writes a row on completion:
source: "eada"— set byrecordBackfillCompletionat end of EADA bulk actionsource: "epa"— set byrecordBackfillCompletionat end of EPA bulk action
Scorecard does not write a backfillMetadata row (it schedules mutations rather than running inline) — check the cron's last run time instead.
Manual triggers
Run any action from your terminal with the Convex CLI:
# Full Scorecard refresh (all ~6500 schools)
npx convex run features/colleges/actions:scorecardBackfill
# Single-school fetch and cache (Scorecard only)
npx convex run features/colleges/actions:fetchAndCache '{"unitId":110635}'
# Single-school EADA (athletics)
npx convex run features/colleges/actions:fetchAndCacheAthleticsForCollege '{"unitId":110635}'
# Single-school EPA (walkability)
npx convex run features/colleges/actions:fetchAndCacheWalkabilityForCollege '{"unitId":110635}'
# Single-school IPEDS (application fee + characteristics)
npx convex run features/colleges/actions:fetchAndCacheIpedsForCollege '{"unitId":110635}'
# Recompute derived fields for all colleges (after a derive.ts change)
npx convex run features/colleges/internal:runDerivedBackfillTo target a non-default Convex environment (e.g. production):
npx convex run --prod features/colleges/actions:scorecardBackfillSmoke checks after backfill
Run these checks after any bulk backfill to confirm data quality before releasing to users.
# Spot-check a known school: MIT (110635), UCLA (110662), Howard (125231)
npx convex run features/colleges/internal:byUnitId '{"unitId":110635}'
npx convex run features/colleges/internal:byUnitId '{"unitId":110662}'
npx convex run features/colleges/internal:byUnitId '{"unitId":125231}'Manual checks in the Convex dashboard (Tables → colleges):
- Row count should be ~6,500 after a full Scorecard backfill (filter
active: true) admissions.admitRateshould be non-null for the ~3,000 schools that report itathletics.divisionshould be non-null for any school after EADA backfillwalkability.scoreNormalizedshould be a 0-100 value for schools with lat/lon after EPA backfillipeds.applicationFeeshould be a whole number or null (never 0 for schools that charge a fee; 0 would indicate a bug in the IPEDS mapper)
When NOT to run a backfill
- During App Store review window: Backfills schedule hundreds of mutations in rapid succession. Convex mutation queues can briefly spike latency. Avoid running during the 24h before a scheduled review submission or while a review is active.
- During peak product launch events: Same reason — mutation throughput competes with live user traffic.
- When
@convex-dev/action-cacheversion is being upgraded: The ActionCache name strings ("scorecard-v1","walkability-v1", etc.) are version-keyed. A version bump in a cache name invalidates all cached entries and forces re-fetches. Only trigger a full backfill after the cache-version deploy has settled (allow 10 minutes for deployment propagation).