Engineering Notes

AWS Infrastructure & Cost Intelligence Copilot

A design review of the architecture decisions behind a natural-language interface over live AWS findings — built so every answer is grounded in retrieved evidence, never in the model's own assumptions about an account it has never seen.

~13 min read Architecture & design decisions
← Back to Project Live Demo GitHub

The answer usually exists, just not in one place

A platform or security team operating real AWS infrastructure almost never lacks data. IAM has a credential report. GuardDuty has findings. Cost Explorer has daily spend by service. CloudTrail has every API call. The operational pain isn't missing visibility — it's that the answer to a specific question ("why did our bill spike Tuesday?", "which admin accounts haven't logged in this quarter?") requires manually cross-referencing three or four different consoles, each with its own query language and its own mental model.

This is why the same question keeps becoming a support ticket or a Slack thread instead of something a team member can answer themselves in thirty seconds. Organisations experience this because each AWS service was designed to answer questions about itself, not questions that span services — and building a custom dashboard that joins IAM, cost, and threat data is a real engineering project most platform teams never get time to prioritise, even though the underlying data was always there.

Who this is built for

The intended customer is a platform engineering or cloud operations team — typically three to fifteen engineers — responsible for a non-trivial AWS footprint that has grown faster than the team's tooling has matured. The scale that makes this problem real is somewhere between "a few dozen and a few hundred" AWS resources across IAM, S3, EC2, and related services; small enough that a dedicated security operations tool feels like overkill, large enough that manual console-checking has stopped scaling.

The operational challenge is almost always the same shape: Finance asks why a cost line moved, Security asks about dormant privileged accounts, and the platform team has all of the underlying data but no single interface that connects it. The business driver is rarely a formal security mandate — it's usually a specific recurring frustration (a cost anomaly that took two days to root-cause manually, a security review that required pulling data from four different places) that makes someone ask "why can't we just ask the account directly?"

How a request moves through the system

Collectors poll real AWS services on a schedule and write structured findings into PostgreSQL. A chat interface then runs semantic search over those stored findings before ever involving an LLM — the model only ever reasons over retrieved evidence, it never has open-ended access to the AWS account itself.

AWS account (real, not simulated)
IAM · S3 · EC2 · GuardDuty · CloudTrail · Cost Explorer
↓ scheduled collection
Collectors (Python, boto3)
6 services polled, findings normalised into a common schema
PostgreSQL + pgvector (RDS)
findings table, embedded via Titan Embeddings
↓ user question
Semantic search (pgvector)
cosine similarity retrieval over findings, scoped by detected intent
↓ retrieved findings as context
LLM fallback chain
Bedrock (Claude) → Groq → Gemini → OpenRouter
Streamed, cited answer
references specific resource IDs, never a generic summary

The collector-then-retrieve-then-reason pipeline is a deliberate three-stage separation, and the reasoning behind keeping these stages distinct — rather than letting the LLM query AWS directly — is the first entry in the next section.

What was chosen, and why

Why collectors poll AWS on a schedule, rather than the LLM querying AWS directly
Decision
Six Python collectors using boto3 poll IAM, S3, EC2, GuardDuty, CloudTrail, and Cost Explorer on a schedule and write normalised findings into PostgreSQL. The chat layer never calls AWS APIs directly in response to a user question.
Why
Giving an LLM direct, live AWS API access in response to user questions is both a security risk and a latency problem — every question would require live API calls with their own auth, rate limits, and failure modes. Separating collection from question-answering means the chat path is fast and the blast radius of anything going wrong in collection is contained to a background job, not a live user-facing request.
Alternatives considered
An agent-style architecture where the LLM calls AWS SDK functions as tools mid-conversation. Considered and rejected for this system — it would have made answers slower, harder to make deterministic, and introduced a much larger surface for the model to make a live, possibly destructive AWS API call by mistake.
Trade-off
Findings are only as fresh as the last collection run — a genuinely new IAM change won't appear until the next scheduled poll, not instantly. For a portfolio demo this is invisible; in production it would need to be an explicit, communicated freshness guarantee.
Why pgvector for retrieval, not full-text search or a vector-native database
Decision
Findings are embedded using Titan Embeddings and stored as vectors directly in the same PostgreSQL instance that holds the structured findings data, queried via cosine similarity.
Why
A question like "which accounts represent the biggest risk" doesn't share exact keywords with a finding titled "svc-legacy-backup has AdministratorAccess, 187 days inactive" — semantic retrieval finds the connection that keyword search would miss. Keeping vectors in the same Postgres instance as the structured data, rather than a separate vector-native database, avoids a second system to operate and keeps findings and embeddings transactionally consistent.
Alternatives considered
A dedicated vector database (Pinecone, Weaviate). Rejected for this scale — the findings volume here is small enough that a separate specialised vector store would add operational complexity without a corresponding retrieval-quality benefit.
Trade-off
If findings volume grew by orders of magnitude, pgvector's indexing characteristics would need real evaluation against a purpose-built vector store — this is a decision that's correct at current scale and worth revisiting if that scale changed materially.
Why a four-provider LLM fallback chain, not Bedrock alone
Decision
Every chat response attempts Bedrock (Claude) first, then Groq, then Gemini, then OpenRouter, with each failure's exact reason logged before falling through.
Why
Bedrock model access in a fresh AWS account is genuinely gated in ways outside this system's control — Marketplace subscription state, regional model availability, and payment-instrument verification all sit between "code is correct" and "Bedrock actually responds." A single-provider architecture means any one of those account-level issues takes the whole system down with no workaround.
Alternatives considered
Bedrock-only, accepting downtime during account provisioning issues. Rejected directly after living through it — a Bedrock invocation blocked for over twelve hours on an INVALID_PAYMENT_INSTRUMENT error despite a verified payment method on file, eventually traced to the specific model version being deprecated and requiring an inference-profile ARN rather than a direct model ID. That single incident justified the fallback chain on its own.
Trade-off
Each provider has a different response shape and different failure modes, which means the client code carries real complexity to normalise across all four — and a fallback to a weaker model (Groq's smaller models, in particular) measurably changes answer quality, which is acceptable for demo continuity but is a real, visible trade-off, not a free one.
Why the system prompt explicitly separates priority reasoning from severity labels
Decision
The chat system prompt requires the model to explain the mechanism of risk, rank findings against each other in context, translate technical findings into business language, sequence remediation with rationale, and note ownership — rather than simply listing findings with their stored severity label.
Why
The underlying collector data was already specific and well-written — real usernames, real inactivity periods, real framework references. The first version of the system prompt still produced flattened, generic-sounding answers, because nothing in the prompt asked the model to reason about relationships between findings or translate severity into consequence. The data was never the bottleneck; the instructions were.
Alternatives considered
Adding more example findings to the prompt, assuming richer few-shot examples would fix flattened output. Rejected after closer reading of the actual prompt — the real issue was that "be direct and concise" was actively rewarding the model for listing rather than connecting, regardless of how rich the underlying findings were.
Trade-off
Trading brevity for connected reasoning means responses are longer. That's an intentional choice for this use case — a security team reading four well-reasoned sentences gets more value than ten generic bullet points — but it's a real trade-off against raw response speed and token cost.

One request, start to finish

Walking through what happens when a user asks "which IAM users have no MFA enrolled?"

Question arrives at the chat endpoint
The question and recent conversation history are posted to the backend along with a session identifier, which is generated on the first message of a new conversation and persisted for the rest of the session.
Intent detection narrows the search
Simple keyword matching in code (not an LLM call) detects that this question relates to IAM specifically, which scopes the subsequent semantic search to IAM findings rather than searching the entire findings table indiscriminately.
The question is embedded and findings are retrieved
Titan Embeddings converts the question into a vector; pgvector finds the most semantically similar stored findings within the IAM-scoped subset, returning the specific records — usernames, inactivity periods, severity, framework references — as structured context.
Retrieved findings are assembled into the LLM prompt
The system prompt (covering mechanism of risk, relative priority, business translation, remediation sequencing, and ownership) is combined with the retrieved findings and sent to the LLM fallback chain, starting with Bedrock.
The answer streams back token by token
Server-sent events stream the response as it generates, and the full exchange — question and answer — is persisted to the session's chat history table immediately after streaming completes, so a page refresh recovers the conversation rather than losing it.

What's actually implemented

Least privilege IAM role
The EC2 instance role grants read-only access to exactly the services the collectors poll, plus a narrowly scoped Bedrock InvokeModel permission limited to specific model ARNs — not a broad Bedrock policy.
Prompt grounding
The system prompt instructs the model to ground every claim in retrieved findings and explicitly say when the findings don't contain enough information to answer — rather than filling gaps with plausible-sounding assumptions.
Hallucination mitigation via retrieval
Because the model only ever reasons over retrieved findings, not live AWS access or its own training knowledge, the most direct hallucination risk — inventing a resource or finding that doesn't exist — is structurally constrained by what retrieval actually returns.
No read-only enforcement on the database credential
The application's database user has full read/write access to all tables, not scoped per-table — acceptable for the current single-application design, a real gap if this database were ever shared more broadly.
Secrets in environment variables
API keys and database credentials live in an environment file on the EC2 host, not in a dedicated secrets manager — the same pattern, and the same disclosed limitation, as the other two portfolio projects sharing this infrastructure.
No write access to AWS from the chat path
The chat interface is strictly read-and-reason. There is no code path by which a user's question can trigger a write or delete operation against the underlying AWS account — collection is the only component with AWS API access, and it's read-only by IAM policy.

Where this would need to change

Horizontal collectors. The six collectors currently run sequentially on a single schedule. Splitting each into its own independently-scheduled job — with its own retry and back-off logic — would let a slow GuardDuty poll stop blocking a fast IAM poll, and would make it straightforward to add new collectors without touching existing ones.

Event-driven ingestion. Polling on a fixed schedule means findings are only ever as fresh as the last run. For services that support it (CloudTrail, GuardDuty), moving to EventBridge-triggered ingestion would mean findings appear near-real-time rather than on a 15-minute polling cycle.

Multiple AWS accounts. The current design assumes a single AWS account. Supporting an AWS Organization with many accounts would mean either a per-account collector deployment or a centralised collector assuming cross-account roles — a meaningfully different IAM model than the single-account read-only role used today.

Caching repeated questions. Common questions ("what's our risk posture right now") re-embed and re-search on every ask, even when the underlying findings haven't changed since the last collection run. A cache keyed on findings-version plus question would avoid redundant retrieval and LLM calls for genuinely repeated questions.

What this honestly doesn't do yet

Findings are seeded for demo purposes alongside real collection
The system supports both live collection from a real AWS account and a seed script that populates realistic demo findings. For portfolio demonstration purposes, seeded data is used so the chat has rich, specific findings to reason over without depending on a particular AWS account's actual state at demo time.
Single-account scope
Collectors are written against one AWS account with one IAM role. Multi-account support, as noted in Section 7, would require real architectural changes, not just configuration.
No automated remediation
The system recommends remediation steps in its answers but does not execute them — there's no code path that, for example, actually rotates a credential or removes a public S3 bucket policy. Every action remains a human decision.
Polling interval is fixed, not adaptive
All six collectors run on the same schedule regardless of how frequently each underlying service's data actually changes — Cost Explorer data changes daily; GuardDuty findings can appear at any time. There's no per-service polling cadence today.

If this were going to production tomorrow

Observability on the fallback chain itself. The four-provider LLM fallback is genuine resilience, but there's currently no dashboard showing which provider actually served the last hundred requests, or how often the chain falls past Bedrock. That visibility is the first thing a production on-call engineer would need and the first thing currently missing.

Multi-account, multi-tenant architecture. Moving from one AWS account to supporting many customers' accounts means re-architecting the IAM model entirely — cross-account roles, per-tenant data isolation in the findings table, and a customer-facing concept of "which account is this finding from" that doesn't exist today because there's only ever been one.

Collector reliability. Each collector currently runs as a best-effort scheduled job with print-statement logging. Production needs structured logging, retry with back-off on transient AWS API failures, and alerting when a collector silently stops producing findings — distinct from alerting on the findings themselves.

CI/CD and audit trail. Deployment today follows the same manual git pull and systemctl restart pattern as the other two portfolio projects sharing this EC2 host — the same gap, the same fix needed: an automated pipeline with health-check gating before any production rollout.

Performance under real findings volume. The current pgvector setup is untested against the kind of findings volume a genuinely large AWS estate (thousands of resources, months of CloudTrail history) would produce. Index tuning and potentially a different retrieval strategy would need real load testing before that scale, not an assumption that what works today keeps working unchanged.

What this actually taught

Rich data doesn't guarantee rich answers — the prompt has to ask for the reasoning explicitly

The most consequential realisation in this project's history: the seeded findings data was always specific and well-written, but the chat answers still came out flattened and generic until the system prompt was rewritten to explicitly require mechanism-of-risk reasoning, relative prioritisation, business translation, and remediation sequencing. Good data is necessary but not sufficient — the model has to be told what to do with it, in detail, not just told to "be helpful."

Cloud provider account-level gating is a real production concern, not an edge case

Twelve-plus hours of Bedrock being unavailable due to a payment-instrument verification issue, on an account with a genuinely valid payment method, is the kind of failure that doesn't show up in a model's own documentation — it's an artefact of how a specific cloud provider gates access at the account level. Building real resilience meant treating "the AWS account itself might not cooperate" as a first-class failure mode, not just "the model might be down."

Each LLM provider's failure mode is different, and silent failures are the expensive ones

One fallback provider's client code originally assumed every response contained a choices key, and threw an unhelpful KeyError when a rate-limited response didn't — completely obscuring the actual cause until the response shape was inspected directly. The fix, checking the response shape explicitly and surfacing the provider's real error text, is a small code change that materially shortened every subsequent debugging session involving that provider.

What would be redesigned

The intent-detection step that scopes semantic search to a specific service (IAM, cost, etc.) is currently simple keyword matching written directly into the chat handler. A cleaner design would separate intent classification into its own testable component with its own test cases, rather than inline logic that's easy to silently break while changing something else in the same function.