A design review of the architecture decisions behind an AI system that runs the first three meetings of a Solutions Engineering engagement — adaptive discovery, deterministic gap analysis, and vendor-grounded recommendations.
Most enterprise security and infrastructure purchases don't fail because the vendor's product is weak. They fail because the discovery conversation never surfaced the real problem. A customer says "we want Zero Trust" or "we need better IAM" — and a junior SE takes that statement at face value, runs a demo of relevant features, and loses the deal to a competitor who asked one more question.
The operational pain isn't technical. It's that discovery quality is inconsistent and person-dependent. A senior SE with ten years of pattern-matching experience asks sharper follow-up questions than someone six months into the role, and that gap directly determines whether the resulting recommendation addresses the customer's actual constraint or just their stated one. Organisations experience this because discovery skill is tacit knowledge — it lives in individual SEs' heads, not in a repeatable process, which means deal quality varies with headcount turnover and ramp time rather than with the strength of the product being sold.
The harder version of this problem: even when discovery is done well, the output is usually a set of notes in a CRM, not a structured artefact that can be scored, compared, or handed to a second person without re-explaining the whole conversation. That's the gap this system targets — not replacing the SE's judgement, but giving the judgement a consistent structure to operate inside.
The intended customer profile is a mid-market to enterprise organisation — typically 500 to 10,000 employees — that has grown its identity and security posture organically rather than by design. Four scenarios are modelled directly in the system, each reflecting a distinct shape of this problem: a 500-person SaaS company with Okta deployed but no SCIM automation; a 2,000-person FinTech with hybrid on-prem and cloud identity and a VPN-dependent access model; a 750-person healthcare provider with PHI exposure and no identity governance; and a 10,000-person enterprise mid-acquisition, integrating a second company's identity estate.
What these four have in common is the operational profile that makes discovery hard: a small identity or security team (often two to four people) responsible for an estate that has outgrown manual processes, genuine compliance pressure with a fixed deadline (an audit, a certification renewal, a board mandate), and at least one existing tooling investment that any recommendation has to account for rather than ignore. The business driver is rarely "we read about Zero Trust and want it" — it's usually a near-miss incident, an upcoming audit, or a new executive (often a CISO) who inherited a posture they didn't build and need to defend it credibly within months, not years.
Five engines sit behind one FastAPI backend, each responsible for one stage of the engagement. The frontend is a single static HTML page — no build step, no framework — that calls these engines through versioned REST routes and renders streamed responses directly.
The shape that matters most here is the split between the discovery/recommendation/executive engines (which call an LLM) and the gap analysis engine (which doesn't call an LLM at all — it's plain Python pattern-matching over extracted entities). That split is the subject of the next section, because it was the single decision that shaped everything downstream.
"no scim" against a list that actually contained "no scim automation" — a list-membership check, not a substring check — so the detector silently never fired despite the right words being present. Fixed by rewriting the matcher to do genuine substring search across the full transcript text, and broadening the keyword set to match how people actually talk ("it took us seventeen days to find and kill their access") rather than only security-jargon phrasing.vendor_capabilities table with embeddings, retrieved by similarity search before the LLM ever reasons about which vendor fits.choices key and threw an unhelpful KeyError when a rate-limited response didn't. Fixed by checking the response shape explicitly and surfacing the provider's actual error text, which materially shortened debugging time the next time it happened.top_3_priorities is meant to be a list of gap ID strings; the in-memory object always had it computed correctly, but the RDS-reload function originally sliced the first three full gap objects instead of extracting their IDs — invisible until a restart actually forced that code path to run for the first time. The same shape mismatch recurred independently in the recommendations loader and the generic advanced-analysis loader, each only surfacing the moment a restart exercised that specific path. The fix in each case was the same principle: make the RDS-reload path reconstruct the exact same typed object the in-memory path produces, not just return compatible-looking data.Walking through what happens when a customer answers a discovery question, end to end.
The current architecture comfortably handles a single user running one demo session at a time — which is its actual purpose. A few concrete changes would be needed to take it further, and the reasoning for each is specific rather than generic.
Connection pooling. Each database call currently opens and closes its own connection rather than drawing from a pool. At low concurrency this is invisible; under genuine multi-user load it would become the first bottleneck, well before the LLM calls themselves.
Vendor card growth. The RAG layer currently covers eight vendors with hand-curated capability cards. Scaling that to fifty or a hundred vendors is mechanically straightforward — more rows, more embeddings — but the curation work to keep each card accurate as vendor products change is a real, ongoing cost that doesn't scale automatically with the data.
Caching the embedding step. Every recommendation call currently re-embeds the gap context fresh. Gap phrasing is drawn from a fixed catalogue, so the embedding for a given gap ID is effectively static and could be cached rather than recomputed on every request.
Multi-tenant session isolation. Sessions are addressable by UUID today with no ownership boundary. Supporting multiple simultaneous SE users without one being able to enumerate another's session IDs would require an actual authentication and authorisation layer in front of the session API — not a large change, but a necessary one before this could be anything other than a single-operator demo.
Observability first. Right now, diagnosing a failure means reading raw journalctl output and tracing a stack trace by hand — which is exactly what happened repeatedly during development. A production version needs structured logging with request IDs that tie a failure to a specific session and engine call, plus basic dashboards on LLM provider success rate per fallback tier, since the four-provider chain currently has no visibility into which provider is actually serving traffic at any given moment.
Multi-tenancy and RBAC. Today there's no concept of "which SE owns this session" — adding that means an actual users table, session ownership, and route-level authorization checks before any session data is returned, not just an honesty assumption that nobody will guess another session's UUID.
Reliability beyond the LLM fallback chain. The four-provider LLM fallback is real resilience, but the database is still a single RDS instance with no read replica and no automated failover tested. A production deployment needs to ask the same "what happens when this fails" question about PostgreSQL that's already been answered for the LLM layer.
CI/CD. Deployment today is a manual git pull and systemctl restart on the EC2 host — which is exactly the kind of manual step that caused real problems during development, including a service restart that briefly served stale frontend JavaScript against fixed backend code. A real pipeline would build, test, and deploy automatically, with a health check gating the cutover.
Audit logging. Gap analysis is deterministic and reproducible, which is a strong foundation — but there's no record today of who ran an analysis, when, or what the system prompt looked like at that point in time. For anything resembling a compliance artefact, that provenance matters as much as the determinism itself.
Deciding that gap scoring should be deterministic code, not LLM output, was the easy part of the design. The hard part was that "deterministic" still depends on correctly extracted entities feeding into it — and entity extraction is itself an AI step. The bug where keyword detection silently failed due to a substring-matching error wasn't a failure of the deterministic-vs-AI principle; it was a reminder that even the deterministic half of the system has real bugs that look exactly like AI flakiness until you actually read the code.
An early version of the discovery prompt used a numbered list to teach the model a response shape — "1. reflect, 2. explain why it matters, 3. ask a question." A smaller fallback model, when it served a request, occasionally treated those labels as content to narrate rather than structure to follow, producing answers that literally said "why that matters for this conversation is..." as visible output. Replacing the numbered instructions with worked examples — showing the shape rather than labelling it — fixed this completely. That's now a standing rule for every prompt in the system.
The shape-mismatch bugs between the in-memory cache and the RDS-reload path were each invisible for as long as the same process stayed running — which, during active development, was most of the time. Every one of them surfaced only after a deploy or restart forced the previously-untested reload path to actually execute. The lesson that stuck: if a code path only runs after a restart, it needs to be tested by actually restarting, not inferred to be correct because the in-memory version worked.
The generic _load() function shared across four different analysis types (objections, architecture options, stakeholders, deal risk) was meant to reduce duplication, but it also meant a single shape bug there silently affected all four call sites at once, surfacing one at a time as each was exercised. A version with explicit per-type loaders, even at the cost of some repeated code, would have caught this category of bug in one pass instead of four separate debugging sessions across a single afternoon.