ai

How to Build a RAG Assistant That Actually Answers From Your Company Docs

A practical, vendor-neutral guide to building an AI assistant grounded in your own handbook, policies, and support history, plus the failure modes to avoid.

How to Build a RAG Assistant That Actually Answers From Your Company Docs

Most internal AI assistants fail the same way. Someone wires a chatbot to a language model, points it at a pile of documents, and ships it. Two weeks later the support team stops using it because it confidently invents a refund policy that does not exist. The model was never the problem. The pipeline around it was.

The technique that fixes this is retrieval before generation, usually called RAG. The idea is simple. Before the model answers, you go find the relevant passages from your own documents and hand them to the model as context. Done well, you get an assistant that answers from your handbook instead of from the open internet. Done poorly, you get a faster way to be wrong. Here is how the pipeline actually works, and where teams lose the thread.

Start With Ingestion and Cleaning

Everything downstream depends on what you feed in. If your source documents are messy, your answers will be messy, and no amount of clever prompting will save you.

Ingestion is the unglamorous part that decides everything. You pull content from wherever it lives: a policy handbook in PDF, product docs in a wiki, support tickets in a help desk, onboarding guides in shared drives. Each format brings its own mess. PDFs scramble reading order and turn tables into nonsense. HTML drags in navigation menus and cookie banners. Exported tickets carry signatures and disclaimers that add noise.

The cleaning step is where you strip that out:

  • Remove boilerplate like headers, footers, and repeated legal text
  • Preserve real structure such as headings, lists, and tables
  • Capture metadata you will need later: source, author, last updated date, and access level
  • Drop duplicates and near duplicates so the same answer does not crowd out everything else

Skip this and you poison the well. Garbage passages retrieved at query time become garbage in the answer.

Chunking Is Where Most Projects Quietly Break

Language models read context in pieces, so you split each document into chunks. This sounds trivial. It is the single most common place we see projects go wrong.

Chunk too large and retrieval gets vague. A several-thousand-word chunk might technically contain the answer, but it buries one useful sentence in ten paragraphs of unrelated text, and the model loses the thread. Chunk too small and you sever the context. A two sentence fragment about "the 30 day window" is useless if the surrounding text that explains what window means got cut away.

The practical middle ground is to chunk along the document's natural structure: by section, by heading, by logical unit. A few hundred words per chunk is a reasonable starting point for prose, with some overlap between neighbors so a thought that spans a boundary is not lost. Tables, code, and policy clauses often deserve their own handling. There is no universal number here, which is exactly why you test it rather than guess.

Embeddings and the Vector Store

Once you have clean chunks, you need a way to find the right ones fast. That is what embeddings and a vector store do.

An embedding turns text into a list of numbers that captures meaning. Two passages about the same topic land near each other in that numeric space, even if they share no exact words. So when a user asks "how long do I have to return something," the system can find a chunk that says "items may be sent back within 30 days," because the meaning matches even though the wording does not.

You store those embeddings in a vector store, which is a database built to answer one question quickly: given this query, which chunks are closest in meaning? A few things to settle early:

  • Pick one embedding model and stay consistent. Queries and documents must be embedded the same way.
  • Keep metadata next to each vector so you can filter by source, date, or permission.
  • Plan for re-embedding. If you change the embedding model later, you re-embed everything.

The vector store is not a magic box. It returns whatever is closest, relevant or not, which is why the next steps matter.

Retrieval, Grounding, and Citations

This is the part users actually feel. Get it right and the assistant feels trustworthy. Get it wrong and it feels like a confident intern who skimmed the manual.

Retrieval runs before the model writes a word. The user's question gets embedded, the vector store returns the closest chunks, and only then does the model generate an answer using those chunks as its source material. The instruction to the model is blunt: answer from the provided passages, and if they do not contain the answer, say so.

Grounding means the answer is tied to real text, not the model's memory. The strongest signal that grounding is working is citations. Every claim should point back to the chunk it came from, ideally with a link to the source document. This does two things at once. It lets users verify the answer themselves, and it gives you a fast way to debug. When an answer is wrong, you look at what was retrieved and usually find the problem in seconds.

Be wary of over-retrieval. Stuffing a couple dozen marginally related chunks into the prompt does not make answers better. It dilutes the good passages, confuses the model, and runs up cost. Retrieving fewer, sharper chunks almost always beats retrieving more.

Access Control, Freshness, and Evals

A working demo is not a working system. Three things separate the two, and all three get skipped under deadline pressure.

Access control has to live in retrieval, not just the interface. If a contractor asks about executive compensation, the fix is not a polite refusal in the chat window. The salary documents should never enter the candidate set in the first place. That means tagging every chunk with a permission level and filtering on the user's identity before retrieval runs. Bolt this on at the end and you will leak something you should not.

Stale data quietly erodes trust. A policy changes, the document gets updated, but the index still holds last quarter's version, so the assistant cheerfully cites a rule that no longer applies. You need a refresh process: re-ingest on a schedule, or better, update the index when source documents change. Track the last updated date so you can show it and so old chunks can be retired.

Evals are how you earn the right to trust it. Before launch and after every meaningful change, run the assistant against a set of real questions with known good answers. Measure whether it retrieved the right passages, whether the answer was grounded, and whether it correctly said "I do not know" when the documents were silent. Without this, you are shipping on vibes, and you will not notice when an update quietly makes things worse. A modest test set, even a few dozen representative questions maintained by hand, catches most regressions.

The honest summary: the model is the easy part. The durable systems are the ones with clean ingestion, sane chunking, citations, real access control, fresh data, and evals you actually run.

At 1 Degree Solutions, we build and ship custom AI products, chatbots, and Alexa skills, including RAG assistants grounded in a company's own documents. If you are weighing one of these for your team, we are happy to talk through what it would take.

Aarav Patel

Engineering notes from a boutique studio.

← All posts