Blog
Vector Database Architecture for On-Premises RAG Pipelines
How to select, deploy, and operate a vector database inside your own infrastructure to power retrieval-augmented generation without sending data to the cloud.
Why vector database choice is a first-class infrastructure decision
Retrieval-augmented generation lives or dies on the quality and speed of its retrieval layer. When you run that layer entirely on-premises, the vector database is not just an index—it is a long-lived operational component that must integrate with your backup routines, security controls, and capacity planning processes. Organizations that treat it as a configuration detail rather than an infrastructure commitment regularly hit problems six months into production: unexpected memory pressure, opaque replication semantics, or index formats that cannot be upgraded without downtime.
This post walks through what distinguishes the major self-hosted options, how to design the storage and serving topology, and which operational habits prevent the problems that quietly erode RAG reliability over time.
Mapping the self-hosted landscape
Four vector databases dominate on-premises RAG deployments today: pgvector (a PostgreSQL extension), Qdrant, Milvus, and Weaviate. Each occupies a different position on the complexity-capability curve.
pgvector is the pragmatic default for organizations already running PostgreSQL. You get vector similarity queries alongside relational joins, a single backup target, and familiar tooling. The ceiling is real—pgvector's IVFFlat and HNSW indexes saturate at tens of millions of vectors on a single node before query latency climbs—but most enterprise knowledge bases stay well inside that envelope. If your team already understands PostgreSQL operations, starting here avoids introducing a second storage system prematurely.
Qdrant is written in Rust and built around the HNSW algorithm with a segmented storage model. It handles hundreds of millions of vectors per node credibly, exposes a clean gRPC and REST API, and supports named payload fields for metadata filtering. Its on-disk mode stores vectors outside RAM, enabling large indexes on modest hardware at the cost of slightly higher tail latency.
Milvus separates storage, coordination, and query execution into distinct components. That architecture scales horizontally across many nodes but adds operational complexity: you need etcd for coordination and MinIO or an S3-compatible store for segment persistence. Milvus is the right choice when your organization's retrieval workload is genuinely distributed—multiple teams, multiple collections, high concurrent write throughput—and when you have the platform engineering capacity to operate it.
Weaviate bundles a module system that can call embedding models and rerankers as part of the query pipeline. For teams that want a tighter loop between retrieval and model inference, that integration reduces boilerplate. The tradeoff is that Weaviate's module runtime introduces a dependency on its own container images and release cycle.
Storage and memory topology
Vector indexes are memory-hungry. HNSW graphs typically require roughly 1.4–1.6 bytes per dimension per vector for flat float32 storage, plus graph connectivity overhead. A collection of five million 1536-dimensional embeddings requires roughly 12 GB of RAM to serve at low latency if kept fully in memory. Plan your node sizing before collections grow, not after.
Two strategies mitigate memory pressure. First, quantization: scalar quantization to int8 or product quantization can reduce memory footprint by 4–8x at the cost of a small recall reduction. Both Qdrant and Milvus support this natively; pgvector added scalar quantization in recent releases. Measure recall against your actual queries before assuming the tradeoff is acceptable for your use case.
Second, tiered storage: keep hot collections—recently indexed documents, high-query-frequency namespaces—in RAM-backed indexes, and move cold collections to disk-backed storage. Qdrant's memmap mode and Milvus's tiered segment flushing support this pattern. Automate the tier promotion and demotion logic based on access frequency rather than managing it manually.
For high-availability deployments, replicate across at least two nodes and test failover behavior under realistic load before certifying the topology for production. Vector indexes rebuild slowly after a cold restart; replica lag during write-heavy indexing bursts can cause stale reads if your consistency model is not tuned correctly.
Embedding pipeline integration
The vector database is only one piece of the retrieval pipeline. On-premises RAG also requires an embedding model running locally, a document ingestion service, and a query-time embedding step. The embedding model choice is coupled to the vector database schema: changing the model means re-embedding and re-indexing every document, which can take hours or days for large corpora.
Design your ingestion pipeline around explicit model versioning metadata. Store the embedding model name and revision alongside every vector. When you upgrade the embedding model, run the new and old versions in parallel, re-index incrementally, and switch retrieval traffic only after validating recall on a held-out query set. Systems that skip this discipline end up with mixed-generation indexes that are difficult to diagnose and expensive to repair.
Inference throughput for embedding is often the bottleneck, not the vector database itself. Batch ingestion jobs should group documents into large batches—32 to 128 documents per call depending on your model's context limit—and use a dedicated GPU or CPU worker pool separate from the interactive inference path. This prevents document ingestion from competing with user-facing queries for compute.
Multi-tenancy and access control
Enterprise RAG deployments almost always need to isolate data between teams, products, or customer segments. Vector databases handle this differently, and the choice has security implications.
The strongest isolation is a separate collection per tenant. A tenant's documents are physically separated; a misconfigured query cannot cross boundaries. The operational cost is proportional to the number of collections: index memory, backup jobs, and schema migrations multiply. This model fits well when tenants number in the dozens and have meaningfully different access patterns.
Shared collections with payload-based filtering reduce operational overhead but require that your application layer enforces tenant identifiers on every query. If a caller omits a tenant filter or constructs one incorrectly, cross-tenant data leakage becomes possible. Audit your application code carefully, and add a gateway layer that injects tenant context from authenticated session data rather than relying on clients to supply it.
Qdrant supports named payload fields as first-class filter expressions. Milvus partition keys map to storage partitions and can improve filter performance when cardinality is bounded. pgvector relies on standard PostgreSQL row-level security policies, which integrate naturally with existing database authentication controls.
Observability and operational hygiene
Vector databases emit metrics that most infrastructure monitoring stacks do not collect by default. At minimum, instrument query latency by collection, recall estimation via synthetic benchmark queries on a schedule, index size and segment count, write queue depth, and replica lag. These metrics tell you whether the retrieval layer is healthy long before users report degraded answer quality.
Segment compaction and background index optimization are common sources of latency spikes in Milvus and Qdrant. Schedule heavy compaction jobs during low-traffic windows and configure resource limits so they cannot starve query serving threads. In pgvector, VACUUM and index maintenance interact with the normal PostgreSQL autovacuum schedule; review autovacuum settings for tables that receive high insert rates during document ingestion bursts.
Include the vector database in your standard backup and recovery testing cadence. A cold restore of a large collection takes longer than most teams expect. Measure it before you need it, and document the recovery time objective alongside your other infrastructure SLAs.