NXT
Runbooks

Backfill College Data

Operational runbook

Sources and cadence

SourceConvex actionCronUpdate frequency
College Scorecardfeatures/colleges/actions:scorecardBackfillMonthly, 1st of month 06:00 UTCAnnual upstream (safe to re-run monthly)
Urban EADA (athletics)features/colleges/actions:fetchAndCacheAthleticsForCollegeNone — manual trigger per school, or run per-school loopAnnual upstream (~5yr physical dataset, monthly is safe)
EPA Walkabilityfeatures/colleges/actions:fetchAndCacheWalkabilityForCollegeNone — manual trigger per school~5yr upstream release cycle
Urban IPEDS (app fee)features/colleges/actions:fetchAndCacheIpedsForCollegeNone — manual trigger per schoolAnnual IPEDS release

The Scorecard backfill is the primary dataset. EADA, EPA, and IPEDS patch additional blocks onto existing colleges rows — they require the Scorecard row to exist first.

Initial deploy: run all three in order

Run these in sequence. Each step depends on the colleges rows from the prior step.

Step 1: Scorecard backfill (run first)

npx convex run features/colleges/actions:scorecardBackfill

This paginates the full Scorecard dataset (operating institutions, predominant degrees 0–4 post-Unit 29, size 10+), scheduling an upsert mutation per school via ctx.scheduler.runAfter. The upsert uses a content-hash check — unchanged rows are skipped. Every upsert also calls categorizeFromRawRow (Plan Unit 9) and stamps primaryCategory, categoryVersion, contentVersion, tags, parentUnitId, isBranch, state (top-level promoted). Atomic aggregate fan-out into aggregateWorkforce / aggregateTransferOutcome happens in the same transaction when the category matches.

Expected output post-Unit 29 universe expansion: { scheduled: ~6200, pages: ~62 }. The action itself returns quickly; background scheduled mutations finish within ~5 minutes for monthly cron, longer on first universe-expansion run (~10 min). Watch the Convex dashboard Functions → Scheduled tab to see the queue drain.

Pre-deploy on rehearsal: count category re-assignments. If existing 4,336 rows produce > 50 re-categorizations against your last prod snapshot, abort and investigate the categorize tree before letting the prod cron run.

App Store window: universe expansion (Unit 29) MUST NOT land within 72h of an App Store submission per Plan Risks table. Reviewer testing during a backfill produces flaky empty states.

Do not proceed to steps 2–3 until the scheduled upserts are done (queue is empty).

Step 2: EADA athletics backfill

EADA does not have a dedicated bulk backfill action yet. Run per-school via the existing fetchAndCacheAthleticsForCollege action, or use the runDerivedBackfill pattern as a model if you need to add a bulk action.

For an initial full backfill, the recommended approach is to add a bulk action (eadaBackfill) in features/colleges/actions.ts following the scorecardBackfill pattern, then run:

# When eadaBackfill action exists:
npx convex run features/colleges/actions:eadaBackfill

Expected wall time for full dataset: ~57 minutes (Urban EADA API is rate-limited; the action cache handles re-runs cheaply after first pass).

Step 3: EPA walkability backfill

Walkability requires lat/lon to be present on the college doc (populated by Scorecard — hence the ordering dependency). Same pattern as EADA:

# When epaBackfill action exists:
npx convex run features/colleges/actions:epaBackfill

Expected wall time for full dataset: ~34 minutes (EPA NatWalkInd API, block-group resolution). The walkabilityCache TTL is 180 days — re-runs within that window return cached values and are very fast.

Monthly cron details

The Scorecard refresh cron is declared in packages/backend/convex/crons.ts:

crons.monthly(
  "scorecard refresh",
  { day: 1, hourUTC: 6 },
  internal.features.colleges.actions.scorecardBackfill,
  {},
);

Fires at 06:00 UTC on the 1st of each month. This covers the annual Scorecard data refresh (released in late calendar year). The content-hash check on upsert means months with no upstream changes are essentially no-ops (hashes match → no DB write).

To verify the cron ran: Open Convex dashboard → Cron Jobs → "scorecard refresh" → Last Run. Check the backfillMetadata table for the source: "scorecard" row's lastRunAt field.

To temporarily disable: Comment out the crons.monthly(...) block in crons.ts and deploy. Re-enable by reverting and re-deploying. Do not delete the cron definition — that requires an explicit Convex deployment step to take effect.

Failure recovery

All three backfill actions use the chunked-resumable pattern from features/colleges/internal.ts:backfillDerivedFields as the reference implementation. The pattern:

  1. Actions paginate via ctx.scheduler.runAfter (Scorecard) or cursor-based loops (derived, EADA, EPA)
  2. Each chunk is an independent mutation — partial failures leave already-processed rows intact
  3. Re-running from the top is safe — the upsert hash check and per-school athleticsCache / walkabilityCache / ipedsCache (all backed by @convex-dev/action-cache) skip no-op work

To resume a partially failed Scorecard backfill:

Simply re-run. The Scorecard action re-pages from the start, but the content-hash check on upsert skips rows that are already current. Already-cached schools are fast (ActionCache TTL: 24h for Scorecard).

To check if a school was backfilled:

npx convex run features/colleges/internal:byUnitId '{"unitId":110635}'

Look for updatedAt timestamp and the presence of athletics.fetchedAt / walkability.fetchedAt / ipeds.fetchedAt fields.

To check last backfill run per source:

Query the backfillMetadata table via the Convex dashboard (Tables → backfillMetadata). Each source writes a row on completion:

  • source: "eada" — set by recordBackfillCompletion at end of EADA bulk action
  • source: "epa" — set by recordBackfillCompletion at end of EPA bulk action

Scorecard does not write a backfillMetadata row (it schedules mutations rather than running inline) — check the cron's last run time instead.

Manual triggers

Run any action from your terminal with the Convex CLI:

# Full Scorecard refresh (all ~6500 schools)
npx convex run features/colleges/actions:scorecardBackfill

# Single-school fetch and cache (Scorecard only)
npx convex run features/colleges/actions:fetchAndCache '{"unitId":110635}'

# Single-school EADA (athletics)
npx convex run features/colleges/actions:fetchAndCacheAthleticsForCollege '{"unitId":110635}'

# Single-school EPA (walkability)
npx convex run features/colleges/actions:fetchAndCacheWalkabilityForCollege '{"unitId":110635}'

# Single-school IPEDS (application fee + characteristics)
npx convex run features/colleges/actions:fetchAndCacheIpedsForCollege '{"unitId":110635}'

# Recompute derived fields for all colleges (after a derive.ts change)
npx convex run features/colleges/internal:runDerivedBackfill

To target a non-default Convex environment (e.g. production):

npx convex run --prod features/colleges/actions:scorecardBackfill

Smoke checks after backfill

Run these checks after any bulk backfill to confirm data quality before releasing to users.

# Spot-check a known school: MIT (110635), UCLA (110662), Howard (125231)
npx convex run features/colleges/internal:byUnitId '{"unitId":110635}'
npx convex run features/colleges/internal:byUnitId '{"unitId":110662}'
npx convex run features/colleges/internal:byUnitId '{"unitId":125231}'

Manual checks in the Convex dashboard (Tables → colleges):

  1. Row count should be ~6,500 after a full Scorecard backfill (filter active: true)
  2. admissions.admitRate should be non-null for the ~3,000 schools that report it
  3. athletics.division should be non-null for any school after EADA backfill
  4. walkability.scoreNormalized should be a 0-100 value for schools with lat/lon after EPA backfill
  5. ipeds.applicationFee should be a whole number or null (never 0 for schools that charge a fee; 0 would indicate a bug in the IPEDS mapper)

When NOT to run a backfill

  • During App Store review window: Backfills schedule hundreds of mutations in rapid succession. Convex mutation queues can briefly spike latency. Avoid running during the 24h before a scheduled review submission or while a review is active.
  • During peak product launch events: Same reason — mutation throughput competes with live user traffic.
  • When @convex-dev/action-cache version is being upgraded: The ActionCache name strings ("scorecard-v1", "walkability-v1", etc.) are version-keyed. A version bump in a cache name invalidates all cached entries and forces re-fetches. Only trigger a full backfill after the cache-version deploy has settled (allow 10 minutes for deployment propagation).

On this page