A design review of the architecture decisions behind a natural-language interface over live AWS findings — built so every answer is grounded in retrieved evidence, never in the model's own assumptions about an account it has never seen.
A platform or security team operating real AWS infrastructure almost never lacks data. IAM has a credential report. GuardDuty has findings. Cost Explorer has daily spend by service. CloudTrail has every API call. The operational pain isn't missing visibility — it's that the answer to a specific question ("why did our bill spike Tuesday?", "which admin accounts haven't logged in this quarter?") requires manually cross-referencing three or four different consoles, each with its own query language and its own mental model.
This is why the same question keeps becoming a support ticket or a Slack thread instead of something a team member can answer themselves in thirty seconds. Organisations experience this because each AWS service was designed to answer questions about itself, not questions that span services — and building a custom dashboard that joins IAM, cost, and threat data is a real engineering project most platform teams never get time to prioritise, even though the underlying data was always there.
The intended customer is a platform engineering or cloud operations team — typically three to fifteen engineers — responsible for a non-trivial AWS footprint that has grown faster than the team's tooling has matured. The scale that makes this problem real is somewhere between "a few dozen and a few hundred" AWS resources across IAM, S3, EC2, and related services; small enough that a dedicated security operations tool feels like overkill, large enough that manual console-checking has stopped scaling.
The operational challenge is almost always the same shape: Finance asks why a cost line moved, Security asks about dormant privileged accounts, and the platform team has all of the underlying data but no single interface that connects it. The business driver is rarely a formal security mandate — it's usually a specific recurring frustration (a cost anomaly that took two days to root-cause manually, a security review that required pulling data from four different places) that makes someone ask "why can't we just ask the account directly?"
Collectors poll real AWS services on a schedule and write structured findings into PostgreSQL. A chat interface then runs semantic search over those stored findings before ever involving an LLM — the model only ever reasons over retrieved evidence, it never has open-ended access to the AWS account itself.
The collector-then-retrieve-then-reason pipeline is a deliberate three-stage separation, and the reasoning behind keeping these stages distinct — rather than letting the LLM query AWS directly — is the first entry in the next section.
INVALID_PAYMENT_INSTRUMENT error despite a verified payment method on file, eventually traced to the specific model version being deprecated and requiring an inference-profile ARN rather than a direct model ID. That single incident justified the fallback chain on its own.Walking through what happens when a user asks "which IAM users have no MFA enrolled?"
Horizontal collectors. The six collectors currently run sequentially on a single schedule. Splitting each into its own independently-scheduled job — with its own retry and back-off logic — would let a slow GuardDuty poll stop blocking a fast IAM poll, and would make it straightforward to add new collectors without touching existing ones.
Event-driven ingestion. Polling on a fixed schedule means findings are only ever as fresh as the last run. For services that support it (CloudTrail, GuardDuty), moving to EventBridge-triggered ingestion would mean findings appear near-real-time rather than on a 15-minute polling cycle.
Multiple AWS accounts. The current design assumes a single AWS account. Supporting an AWS Organization with many accounts would mean either a per-account collector deployment or a centralised collector assuming cross-account roles — a meaningfully different IAM model than the single-account read-only role used today.
Caching repeated questions. Common questions ("what's our risk posture right now") re-embed and re-search on every ask, even when the underlying findings haven't changed since the last collection run. A cache keyed on findings-version plus question would avoid redundant retrieval and LLM calls for genuinely repeated questions.
Observability on the fallback chain itself. The four-provider LLM fallback is genuine resilience, but there's currently no dashboard showing which provider actually served the last hundred requests, or how often the chain falls past Bedrock. That visibility is the first thing a production on-call engineer would need and the first thing currently missing.
Multi-account, multi-tenant architecture. Moving from one AWS account to supporting many customers' accounts means re-architecting the IAM model entirely — cross-account roles, per-tenant data isolation in the findings table, and a customer-facing concept of "which account is this finding from" that doesn't exist today because there's only ever been one.
Collector reliability. Each collector currently runs as a best-effort scheduled job with print-statement logging. Production needs structured logging, retry with back-off on transient AWS API failures, and alerting when a collector silently stops producing findings — distinct from alerting on the findings themselves.
CI/CD and audit trail. Deployment today follows the same manual git pull and systemctl restart pattern as the other two portfolio projects sharing this EC2 host — the same gap, the same fix needed: an automated pipeline with health-check gating before any production rollout.
Performance under real findings volume. The current pgvector setup is untested against the kind of findings volume a genuinely large AWS estate (thousands of resources, months of CloudTrail history) would produce. Index tuning and potentially a different retrieval strategy would need real load testing before that scale, not an assumption that what works today keeps working unchanged.
The most consequential realisation in this project's history: the seeded findings data was always specific and well-written, but the chat answers still came out flattened and generic until the system prompt was rewritten to explicitly require mechanism-of-risk reasoning, relative prioritisation, business translation, and remediation sequencing. Good data is necessary but not sufficient — the model has to be told what to do with it, in detail, not just told to "be helpful."
Twelve-plus hours of Bedrock being unavailable due to a payment-instrument verification issue, on an account with a genuinely valid payment method, is the kind of failure that doesn't show up in a model's own documentation — it's an artefact of how a specific cloud provider gates access at the account level. Building real resilience meant treating "the AWS account itself might not cooperate" as a first-class failure mode, not just "the model might be down."
One fallback provider's client code originally assumed every response contained a choices key, and threw an unhelpful KeyError when a rate-limited response didn't — completely obscuring the actual cause until the response shape was inspected directly. The fix, checking the response shape explicitly and surfacing the provider's real error text, is a small code change that materially shortened every subsequent debugging session involving that provider.
The intent-detection step that scopes semantic search to a specific service (IAM, cost, etc.) is currently simple keyword matching written directly into the chat handler. A cleaner design would separate intent classification into its own testable component with its own test cases, rather than inline logic that's easy to silently break while changing something else in the same function.