NavitecnAvItec
PERSPECTIVES · NAVITEC
Field Note

Retrieval Without Infrastructure

A field note on browser-resident AI, written after spending a weekend putting Chrome's local model to work.

Late one evening I noticed that Chrome was downloading a model. It wasn't asking my permission, it wasn't running anything, it was just sitting on my disk. The model is Gemini Nano. It is small, on-device, browser-managed, and increasingly available to any web page that asks for it through the Prompt API.

I had been reading the announcement notes for the third or fourth time and a question landed: if Chrome is going to host a local large language model on my machine, the model is not going to live there rent free. What can I put it to work doing?

The answer turned out to be more interesting than I expected. Over the course of a weekend I built a working local-first retrieval-augmented generation system that runs entirely inside the browser. No servers. No vector database. No embeddings API. No cloud inference call. The full pipeline, from indexing pages I was reading to generating cited answers across them, runs in Chrome. It is called ChromeRAG. The code is on GitHub.

This is a field note about what I built, what I learned, and why I think the broader pattern matters for the kind of work I do.

SECTION ONE · THE OBSERVATION

The observation.

Most current AI deployments in regulated organisations assume cloud inference as a default. The architecture diagram has a box labelled "the model", and that box sits in a vendor's data centre. Sometimes the model is OpenAI, sometimes Anthropic, sometimes Azure OpenAI, sometimes a sovereign deployment of an open-weights model in a controlled region. The substance changes. The shape does not.

For most use cases this is fine. For some use cases it is the only viable option. But the assumption has become so total that an alternative pattern has slipped under the radar.

The alternative is this. The browsers most knowledge workers are already using are quietly shipping production-grade local inference. Chrome's built-in Prompt API exposes Gemini Nano directly to web pages. Safari is moving in the same direction. Firefox has its own equivalents in development. The local model is small, but it is sufficient for a meaningful class of work. And it is sitting on hundreds of millions of devices, downloaded but largely unused.

That is a strange situation to leave alone.

SECTION TWO · THE EXPERIMENT

The experiment.

ChromeRAG is built in two forms. The first is a small Node helper plus a static web page, run from localhost. The second, the one that matters, is a Chrome extension that opens as a side panel and indexes pages as you browse them.

The flow is straightforward. You enable the side panel. You click "index current tab" on a page you want to reference later, or you toggle auto-index and let it run silently as you read. The extension's content script extracts the readable text of each page, chunks it, and writes it to chrome.storage.local. Storage is per-profile and browser-managed.

When you ask a question, a keyword retriever scores every stored chunk against the question terms and returns the top eight matches. Those matches are packed into a prompt and sent to Chrome's Prompt API. Gemini Nano runs the synthesis locally, streams the answer back, and renders it next to the cited evidence chunks. The original pages are one click away.

There are four prompt modes. Direct question and answer. Executive briefing. Risk extraction. Source comparison. Same retrieval, different framing instructions to the model. A user picks the workflow that fits the question they are asking.

That is the whole system. The Chrome extension is approximately three hundred lines of JavaScript. There is no backend. There is no API key. The user owns every byte of indexed data and every byte of generated output.

ChromeRAG v6 side panel open alongside the Navitec article, showing the Prompt API status, indexed source counts, the indexing controls, and a query with cited evidence rendered as the local model synthesises the answer.
ChromeRAG v6 side panel, rendering against this article. Local indexing, local retrieval, local synthesis. The browser is the entire stack.
SECTION THREE · THE ARCHITECTURE

Three layers, each one the simplest thing that works.

Three layers. Each one is the simplest thing that works.

Extraction. A content script reads the rendered DOM of the active tab, strips navigation and form elements, and extracts the article-shaped text. This is the layer most cloud RAG systems struggle with, because they have to negotiate API access to whatever system holds the source content. ChromeRAG lives inside the user's authenticated browser session, so any page the user can see in Chrome is indexable. SharePoint behind single sign-on. Internal Notion workspaces. Subscription research sites. Customer portals. Government tenant pages. The browser has already done the authentication work. The extension reads what is there.

Storage and retrieval. Indexed chunks are written to chrome.storage.local, a per-profile key-value store. Retrieval is done with a keyword-scoring function: stop words removed, term frequency counted, top results returned. This is not state-of-the-art retrieval. It is BM25's much simpler cousin. For a corpus of fifty to a few hundred pages, which is what a real knowledge worker actually accumulates while researching a problem, keyword retrieval is sufficient. The decision not to use embeddings is deliberate. Embeddings need a model. The model needs to be loaded or called. The call costs something. The decision to skip a layer is often more interesting than the decision to add one.

Synthesis. The Prompt API exposes Gemini Nano through a small JavaScript surface. The extension creates a session with a system prompt that tells the model what kind of answer to produce, packs the retrieved evidence into the user prompt with explicit source labels, and streams the result back into the UI. The system prompt enforces three behaviours: answer only from the provided evidence, cite every claim using source labels, admit when the evidence is weak. Those three rules do most of the work of keeping the synthesis honest.

SECTION FOUR · THE IMPLICATION

The implication.

For the kind of advisory work I do, this matters more than it might appear.

A senior AI advisory engagement is, very often, a research-heavy piece of work. A client sends a deal room. A regulator publishes a consultation document. A board pack is circulated for a meeting in three weeks. The work is to read all of it, find the load-bearing claims, identify what is missing, and write something useful in response.

The dominant industry pitch for tooling this kind of work is a cloud RAG product that connects to the client's SharePoint or Confluence or Notion. The local-first browser pattern is a different shape of solution. The data never leaves the browser. There is no vendor to vet. There is no contract to negotiate. There is no audit log to subpoena. For a financial services client running an internal investigation, or a healthcare organisation reviewing a regulator's draft guidance, the value of "this never left the laptop" is not abstract. It is a different operating model for a class of research-shaped work.

This is not a claim that local-first beats cloud RAG for every use case. Frontier models do things Gemini Nano cannot. Enterprise-scale corpora do not fit in a browser. The Microsoft Graph connector is the right answer when the goal is to make the entire estate searchable. The point is narrower. There is a meaningful slice of high-trust, research-shaped, advisor-style work where the local pattern is operationally different in ways that matter, and the industry default has stopped considering it.

SECTION FIVE · THE RECOMMENDATION

The recommendation.

When scoping AI work for a regulated client, the first question to ask is not "which model" or "which vendor". It is "where does the inference need to happen, on what data, with what compliance path to production".

The honest failure mode is rarely that compliance blocks the pilot. The pilot ships fine. It runs on a sandbox, or a synthetic dataset, or non-sensitive sample documents. The ROI numbers from the pilot look good. The team writes the business case.

Then production needs the real data. And the work to get the real data through the cloud vendor's processing path turns out to require a significant refactor of how the system is configured, what scopes it has, where the data sits, who has access, and how the audit log is maintained. The compliance team is doing their job correctly. They were not consulted at the pilot stage because the pilot did not touch their data. Now they are being asked to approve production deployment of a system whose design decisions were made without their input. The refactor is real. The ROI from the pilot evaporates. The case for the project weakens. The pilot gets shelved or stays in pilot purgatory indefinitely.

This is a pattern I have seen often enough to consider it the default outcome rather than the exception.

The local-first browser pattern does not have this failure mode in the same shape. Not because it is more compliant by nature, but because the production data path and the pilot data path are the same path. There is no refactor between proving the idea and running the idea on real documents. The browser session that indexed the regulator's draft consultation in the pilot is the same browser session that indexes it in production. The compliance posture does not change between the two.

This is not a claim that local-first is the right answer for every advisory engagement. It is a claim that when you scope AI work, the question to hold in mind is not just "does this work technically" but "does the path from this working to this working on real client data require a refactor that will kill the ROI". For cloud RAG that path is often longer than the pilot's optimism allows. For local-first browser RAG, the path is often the same path.

A useful framing for a board conversation: when your team brings you a pilot ROI number, ask what changes when this moves from pilot to production. If the answer involves OAuth scopes, data residency negotiations, DPA reviews, or any meaningful infrastructure work, the pilot ROI is not the production ROI. Adjust expectations accordingly. Or scope the work in a way that does not introduce that gap in the first place.

SECTION SIX · LIMITS

Limits worth naming.

I want to be specific about what ChromeRAG does not do, because every honest write-up of an experiment should name its own limits before someone else does.

Gemini Nano is a small model. It is genuinely capable for summarisation, citation, comparison, and structured extraction across small corpora. It is not capable of frontier-model reasoning, long-context synthesis across thousands of pages, or production-grade drafting of contractual language. The decision to use a local model means accepting the ceiling that comes with it.

The keyword retriever is good enough for the small corpora a research session produces. It is not good enough for organisational-scale knowledge bases. At scale, embeddings or hybrid retrieval would matter.

The Prompt API itself is still on origin trial in many configurations. Availability depends on Chrome version, device capability, and enterprise policy. Building production tooling on it today is premature. Building experiments on it today is the right time to start.

Auto-indexing as you browse is convenient and slightly dangerous. The extension cannot tell the difference between a research article and your online banking. A serious version of this for client work would need a domain allow-list or a manual-only mode.

NEXT

The model is on the disk.

This is not a product. It is a working prototype that demonstrates a pattern. The pattern is that the components needed for high-quality retrieval-augmented generation are already on the user's device, in the browser they are already using, with no infrastructure required. The model is already there. The storage is already there. The retrieval is a hundred lines of JavaScript.

The interesting question is not whether ChromeRAG specifically is the right tool. It is what becomes possible when the cost of inference falls to zero, the storage moves into the user's profile, and the deployment ceremony shrinks to "load unpacked extension". That is a different operating model for a class of AI work that has, until recently, been impossible to deploy without significant infrastructure.

The model is on the disk. It is not going to live there rent free.

ChromeRAG v6 is available on GitHub. It is a working prototype, not a supported product.