AI and matching
How the matching engine works, what AI does today, and why V2 replaced V1's RAG
This is the most important architectural section to read. V2 made a deliberate, non-trivial change from V1 here.
TL;DR
- Matching is deterministic math. Reach/Fit/Safety is computed from real numbers (GPA, test scores, admit rates, federal net price, completion rates). Same input → same output → fully explainable.
- AI writes the 2–3 sentence "why this might be for you" blurb on each college card. That's it. AI does not decide the match.
- V1 used RAG (Retrieval-Augmented Generation with vector embeddings). It was replaced because the inputs that actually drive college matching are numeric, not textual.
What V1 did
V1's matching engine ran on @convex-dev/rag with OpenAI's text-embedding-3-small (1536-dimensional embeddings). Every college's profile — name, location, stats, top programs, "strong in" categories, features — was concatenated into a structured text blob and embedded. The vectors were stored in Convex's RAG component under the colleges namespace.
A user's quiz responses were converted into a similar query embedding. Cosine similarity over the vector store returned the nearest N colleges; those were handed to a generation model to produce match explanations.
The V1 ingestion pipeline
V1 didn't read Scorecard live. It downloaded the full Scorecard CSV (MERGED2023_24_PP.csv, ~6,000 institutions), parsed it with a batch processor under packages/backend/scripts/ingestion/, called OpenAI for AI-generated descriptions and feature badges, scraped logos via logo.dev, fetched hero images via Serper, then bulk-inserted the fully-formed college documents.
A single ingestion run cost $0.01/college for content generation + Serper calls ($0.90 per 3,000 schools at $0.30/1K queries) + logo.dev. Full pipeline: 2–3 seconds per college; partial runs an order of magnitude faster.
Why V1 didn't work for NXT
- Retrieval was non-deterministic in practice. Identical query embeddings could surface different top-K results across runs due to vector index sharding and tie-breaking. Same student, different day, different list. Impossible to support.
- The signal is numeric, not textual. "Is this school a Reach?" is decided by SAT range + GPA + admit rate, not by how a college's description "sounds like" the student's profile. RAG converts numbers to text, embeds text, then searches by text similarity — three lossy steps where one direct comparison would do.
- Coverage was uneven. Schools without rich text descriptions (most trade schools, many community colleges) had weaker embeddings and got worse matches. The V2 expansion into trade schools — the loudest signal from administrators and teachers — would have been infeasible under V1.
- Cost grew with usage. Every retrieval was a paid embedding + paid generation call. V2 pays for AI once per
(student, school)pair and caches it. - Hallucinations on facts. Generation models embedded under RAG would invent specifics — "great mentorship program" — when no such program existed. There is no way to fact-check a vector-grounded blurb against a numeric truth.
The original V1 RAG implementation lives at packages/backend/convex/services/rag.ts on the legacy/v1-archive branch. The V1 setup guide is at docs/technical/rag-setup-guide.md on that branch. Both are reference-only.
What V2 does
V2 has two layers: a deterministic personalization engine that picks and ranks schools, and a grounded AI blurb that writes the friendly explanation.
Deterministic personalization
The current code lives at:
packages/backend/convex/features/discover/— the rail composer (Picked For You, Learning Style, Campus Vibe, High Value, Hidden Gems, Program Leader, Test Optional, First Generation, MSI).packages/backend/convex/lib/rfsEngine.ts— Reach/Fit/Safety verdict computation.packages/backend/convex/features/rfs/— verdict caching + cleanup.
Every signal is sourced from a federal field:
| Rail / signal | Source field(s) |
|---|---|
| RFS verdict | ADM_RATE, SAT_AVG, ACT* percentiles vs. student GPA/test scores |
| Picked For You | composite score: interests × programs offered, geographic distance, GPA fit |
| Learning Style | personality quiz result mapped to documented school traits |
| Campus Vibe | setting (rural/town/suburb/metro) + size + walkability + politics from quiz |
| High Value | net price ÷ median earnings (10-yr post-grad) |
| Hidden Gems | high quality bucket, low awareness signal |
| Program Leader | top earnings + completions in student's study area (CIP-level) |
| Test Optional | ADMCON7=5 Scorecard flag |
| First Generation | parental education proxy + Pell-eligible cohort outcomes |
| MSI | Scorecard's HBCU / AANAPISI / HSI / TRIBAL / PBI flags |
The Afford peek on every card runs the actual federal net-price formula against the student's financeBracket (one of 5 federal income brackets). "$X for a family in your income range" is a fact pulled from NPT4* fields, not a guess.
The five meaning-first answers on college detail (Afford, Admit, Outcomes, Community, Finish) each pull from named Scorecard fields. Outcomes shows real median earnings 10 years post-graduation (MD_EARN_WNE_P10). Admit shows the school's real admit rate against the student's real GPA/test scores to produce a Reach/Fit/Safety verdict.
The AI blurb (the only AI in the app)
Each college card shows a 2–3 sentence "why this might be for you" blurb. Example:
You're aiming for a small, walkable campus and Bowdoin's setting matches that. Your SAT puts you inside their middle 50% range, and Bowdoin's outcomes for English majors line up with the area you're considering.
Where it lives
- Builder:
packages/backend/convex/lib/openai.ts— pure prompt builder + HTTP client. - Action:
packages/backend/convex/features/colleges/actions.ts→generateUserReasoning. - Cache table:
collegeReasoning(one row per(userId, unitId)). - Cleanup: weekly cron
collegeReasoning cleanupevicts rows older than 30 days.
Model + cost
- Model:
gpt-5.4-nano(cheapest + fastest reasoning-family model, as of 2026-05). - Input: structured
ReasoningUserContext+ReasoningCollegeContext— only the numeric facts the model needs. - Output: 2–3 sentences, ≤200 tokens.
- Temperature: handled by the API default (reasoning models reject custom
temperatureand rejectmax_tokens—max_completion_tokensis used instead). - Timeout: 15s.
Anti-slop guardrails (locked, enforced in system prompt)
- No superlatives ("amazing", "perfect", "world-class").
- No marketing buzzwords ("nurturing community", "vibrant tapestry").
- No AI-vocab tics ("delve", "robust", "pivotal").
- No em dashes.
- Second-person voice ("you").
- 2–3 short sentences max.
- Every claim grounded in a specific data point handed to the model — no inventing facts.
The prompt is grounded: the model receives the student's GPA, test scores, the school's admit rate, the net-price estimate, matching programs, and is instructed to write 2–3 sentences using only those facts.
Failure mode
If OPENAI_API_KEY is missing or OpenAI is down, the call throws; the action layer catches and the school card renders without the blurb. The rest of the card still works.
Cost per user interaction
| Action | Cost (USD, 2026-05) |
|---|---|
| Browsing rails (any number of swipes) | $0 — no AI call |
| Opening a college card (first time) | ~$0.0003 — one gpt-5.4-nano blurb |
| Re-opening the same card within 30 days | $0 — served from collegeReasoning cache |
| Reach/Fit/Safety verdict (computed once per profile change, cached) | $0 — pure math |
| The Afford peek | $0 — federal formula |
Embedding + vector search costs from V1: gone. V2 has no embedding model, no vector store, no cosine search.
Blurb request flow
Operational details
- One env var.
OPENAI_API_KEYon the Convex production deployment. Set vianpx convex env set. - Two weekly cleanup crons.
rfsVerdicts cleanup(Sunday 9:00 UTC) andcollegeReasoning cleanup(Sunday 10:00 UTC) evict rows older than 30 days. Both self-recurse via the scheduler until backlog clears. - Backfill safety. New college fields can be added to Scorecard mappers without re-running V1-style ingestion. The monthly
scorecard refreshcron picks them up on the 1st of each month.
Technical detail
Where to read the prompt
packages/backend/convex/lib/openai.ts exports buildReasoningMessages(user, college). The system prompt is the first message; the user message is the structured fact block. Read both before touching either — anti-slop rules are encoded as terse imperatives, easy to weaken accidentally.
Why not Claude or Gemini
Both are viable. gpt-5.4-nano was chosen for (a) lowest current $/M-input-token at acceptable quality, (b) low p99 latency on short outputs, (c) the existing OpenAI account already had org-level cost controls configured. Switching providers is a one-file change in openai.ts (HTTP client + auth header + response shape). The blurb prompt would need re-tuning for the new model's defaults — superlative + em-dash bans are model-agnostic but each model has its own slop fingerprint.
Why not fine-tune
The blurb's job is to summarize structured data the model is already given. There is no domain-specific vocabulary to teach. Fine-tuning would add operating burden (training pipeline, model versioning, eval set) without measurable quality gain over a good system prompt with strict guardrails.
Why not on-device
expo-router + Hermes does not run an LLM on-device at acceptable latency on mid-range Android. The blurb is short enough that a 300ms cloud round-trip is below the threshold a user notices.
What was deliberately not built
- Re-running the V1 ingestion pipeline anywhere in V2. The
scripts/ingestion/*files exist only onlegacy/v1-archive. - A vector store. No Convex RAG component, no Pinecone, no pgvector.
- A "find similar schools" semantic search. Browse uses Scorecard's
searchIndexonidentity.name+ structured filters (active,primaryCategory,state,ownership). Seepackages/backend/convex/schema.tscolleges.searchIndex("search_colleges_v2", ...).
If a future product decision adds back semantic search, do it as an additive layer on top of the current deterministic engine. Do not unwind the determinism.