Executives usually ask for a “quick” LLM proof-of-concept and then load it with enterprise-grade expectations: perfect data lineage, brand voice controls, SOC 2, the works. The trick is to isolate a thin slice that proves value fast while laying the rails for the real rollout.

Here is the playbook we default to:

Capture the narrative, not the feature list. We spend the first workshop mapping the job stories (“When a field engineer files a fault, they need…”) and the failure modes (“Hallucinations about part numbers break trust instantly”). This is the raw material for prompt design and evaluation criteria.
Design the sandbox. Before writing a single prompt we answer: Which data sources are in scope? What’s the maximum acceptable latency? Who is allowed to try the pilot? The constraints drive architecture choices far more than favourite model families.
Codify evaluation on day one. Every pilot ships with a rubric, even if it is a lightweight scoring spreadsheet or a set of Playwright assertions with golden answers. If we cannot measure improvement, we cannot justify the next phase.
Instrument everything. Logs, prompt/response pairs, timing, user feedback buttons—whatever it takes to understand behaviour without manual digging.

Within 5–7 working days we normally have:

A secured, single-purpose UI (often SvelteKit + mdsvex for docs like this) deployed on Cloudflare.
A prompt and retrieval pipeline that hits the right data every time.
A confidence dashboard stakeholders can open without asking the team for screenshots.

The pilot either unlocks budget for the production build or proves the idea is not worth the organisational lift. Both outcomes are wins—you either accelerate or you stop pouring time into the wrong bet.