Blog
Prompt Injection Defenses for On-Premises RAG: Hardening Retrieval-Augmented Generation
How to layer defenses against direct and indirect prompt injection when documents are retrieved and passed to private LLMs, without relying on cloud-only controls.
Why RAG changes the threat model
Retrieval-augmented generation looks like a clean architecture: keep the model on private infrastructure, ground answers in approved corpora, and avoid sending sensitive prompts to external services. In practice, the retrieved text becomes part of the prompt. If an attacker can influence what gets retrieved, or can smuggle instructions into source documents, they can steer the model without ever touching your API gateway.
On-premises deployment removes some cloud supply-chain concerns, but it does not remove application-layer abuse. Teams that only harden network paths while treating retrieved content as trusted data often discover indirect prompt injection during internal red-team exercises or the first serious pilot with untrusted document sources.
Direct versus indirect injection
Direct prompt injection targets the user message or system prompt: a user tries to override policies, extract system instructions, or trigger disallowed tool calls. Standard mitigations include strict role separation, policy prompts, output filtering, and tool allowlists.
Indirect prompt injection hides instructions inside documents, tickets, emails, or web pages that later appear in the retrieval set. The model obediently follows those instructions because, from its perspective, they are just more context. This is especially relevant when RAG pipelines ingest wikis, support threads, or customer uploads where content is not uniformly authored by trusted staff.
Defense requires assuming that any chunk passed to the model might contain adversarial text. That assumption should shape retrieval design, chunk boundaries, metadata handling, and how tools are exposed to the LLM.
Layer one: constrain what retrieval can return
Start by reducing the attack surface at the index. Use explicit document classes and trust tiers: for example, policy manuals in a high-trust index, user-generated content in a segregated index with stricter downstream rules. Vector search alone does not understand trust; your application must pass tier metadata into the prompt assembly step.
Implement chunk hygiene: strip HTML and embedded scripts from web captures, normalize encodings, and avoid concatenating unrelated sources into a single opaque blob. Smaller, well-attributed chunks make it easier to log which document influenced an answer and to apply tier-specific policies.
Where possible, add retrieval-time scoring thresholds and diversity constraints so a single poisoned document cannot dominate the context window through repeated near-duplicate chunks.
Layer two: separate instructions from untrusted evidence
Prompt assembly should make the model’s role unambiguous. A practical pattern is to wrap retrieved material in clearly delimited blocks labeled as untrusted evidence and to state that instructions inside those blocks must not be followed. This is not foolproof, but it measurably reduces successful jailbreaks in internal testing when combined with other controls.
Use structured system prompts maintained as versioned artifacts, not editable strings scattered through services. Pair them with output schemas or constrained decoding where the task allows it, so the model is nudged toward machine-parseable responses that downstream validators can check.
Layer three: tools, identity, and data exfiltration
RAG often sits upstream of agents that call SQL, APIs, or ticketing systems. If the model can be convinced to emit a tool call, indirect injection can pivot from text generation to action. Mitigations include:
Scoped credentials: the runtime identity for tool calls should have the minimum privileges required for the workflow, not broad user impersonation.
Human approval for sensitive tools: align with patterns described in operational agent governance: high-impact actions route through explicit approval queues.
Outbound filtering: block or alert when generated content attempts to include secrets, internal URLs, or attachment patterns that match exfiltration attempts.
Logging should correlate retrieved document IDs, tool invocations, and user sessions so security teams can reconstruct an incident without relying solely on raw prompts.
Testing, monitoring, and ownership
Run periodic adversarial regression suites against staging environments: curated poison documents, benign-looking instructions embedded in tables, and multilingual obfuscation. Track whether policies and filters still hold after model or embedding upgrades.
Product and security ownership should be explicit: who approves new corpora, who can change retrieval parameters, and who signs off when the model or embedding model version changes. On-premises RAG fails less often from missing GPUs and more often from unclear accountability when behavior shifts after a silent configuration change.
Operational metrics worth dashboarding include spikes in refusals, unusual tool-call patterns, retrieval clusters dominated by low-trust sources, and sudden changes in average context length. Pair technical signals with periodic review of representative transcripts so teams notice qualitative drift before users report it.
Putting it together
Effective defense is cumulative. Network segmentation and private hosting establish where inference runs; RAG-specific controls establish whether retrieved text can hijack behavior. Treat document corpora as potentially hostile, separate policy from evidence in prompts, minimize tool blast radius, and prove your posture with repeatable tests whenever the stack changes.