Giving Your AI a Memory: Building a Local Knowledge Base with pgvector and MCP

I use Claude Code for most of my development work these days, especially investigations and quick-and-dirty spikes. For the most part, it's a huge help – until you realise it has no idea what you talked about yesterday, and Claude.ai does not know what I did in Claude Code. Every session starts from zero, or by throwing lots of reference material at it, filling the context.

This is not a new complaint. Everyone building on LLMs hits the same wall: the context window is finite, and there's no persistence between sessions. My friend and collaborator Kristoffer Nordström had been building exactly this kind of system (he's since written about it here) and hearing him talk about it while it was still in progress was the immediate kick I needed to stop thinking about it and start building. His approach is different from mine (org-mode flat files, LanceDB, local GPU embedding), but the core frustration is identical: having to dump random stuff into the context to make the LLM helpful.

Where he optimised for his own workflow and automation / assistant functions, I went the other direction and focused on packaging and doing one thing well. I wanted something anyone could pipx install and have running in five minutes. That meant PostgreSQL instead of bespoke file formats, a plugin architecture for data sources, a CLI that handles the Docker container for you, and an MCP server that works out of the box. While this means that anyone can pick it up and use it, the real motive is that I will be able to continue development after a one-year hiatus, since there are docs, tests, and conventions.

So I built OKB, the "Owned Knowledge Base" (ok, it's a lame backronym, needed a free package name). It's a local-first semantic search system backed by PostgreSQL with pgvector, exposed via local or http MCP (with token auth).

MCP

Model Context Protocol is an open standard for giving LLMs access to external tools. The LLM decides when to call a tool and what to search for. You provide the tools and their descriptions; the model does the rest.

The interesting bit for a knowledge base: because MCP is model-agnostic, the same server acts as shared memory across different tools. I use the same OKB instance from Claude Code, Claude.ai, Le Chat, and local models, and I could point Cursor and Copilot at it too. With the HTTP MCP server, I share a knowledge base with my wife (topic: farming).

The architecture

The core is extremely simple, you just need a few things: a way to turn documents into vectors, a place to store and an index to search those vectors, and a way to expose that search to an LLM.

PostgreSQL + pgvector

I went with PostgreSQL and the pgvector extension rather than a dedicated vector database. This was a pragmatic choice - I already know Postgres, I already run Postgres, and pgvector is good enough for personal-scale data. We're talking thousands to tens of thousands of documents, not hundreds of millions. At that scale, Postgres handles it fine and you get the bonus of being able to do regular SQL queries alongside vector search.

OKB manages its own Docker container with pgvector pre-installed. okb db start and you're running. The schema supports multiple named databases, which matters when you're a consultant who doesn't want client notes leaking into personal queries, or don't want your development work to leak into your farming projects.

Contextual/metadata-enriched chunking

This took me a while to get right, because the tutorials I found assumed knowledge a poor generalist like me just don't have. Apparently you can't just dump an entire document into one vector, since embedding models have token limits, and longer texts produce worse embeddings. So you chunk. The naive approach is to split on a fixed character count, but that loses all structure.

OKB does contextual chunking: each chunk carries metadata about where it came from.

Document: Django Performance Notes
Project: student-app          # inferred from path or frontmatter, or set in an MCP call
Section: Query Optimization   # extracted from e.g. org/markdown headers
Topics: django, performance   # from frontmatter tags
Content: Use select_related() to avoid N+1 queries...

The project and section context means that when you search for "N+1 queries in the student app", the chunk about select_related() ranks higher than a generic Django tutorial note because the metadata matches, not just the content. Frontmatter tags in your markdown files feed directly into this, so the more structured your notes are, the better the retrieval gets. It also means that normal READMEs for projects are perfect for feeding in, as long as they're up to date…

I'm using a default chunk size of 512 tokens with 64-token overlap. Those numbers are tunable, and I arrived at them through (very brief) experimentation rather than science. Smaller chunks give more precise retrieval but lose context; bigger chunks are more self-contained but dilute the embedding. 512 is largeish, and seems to work for me.

Embeddings

To turn text into vectors, you need an embedding model. Running one locally works but it's slow for bulk operations, and gets painful for ingesting thousands of documents or source files or my entire Slack history. Some buy a cabinet to put more GPUs in, but that's both expensive and impractical when spending half the year in a van.

My solution was to split the workload: batch ingestion runs on a GPU via Modal, while query-time embedding runs locally. Modal spins up a T4 GPU (configurable) on demand, processes the batch, and shuts down. Cost is roughly $0.02 per thousand chunks. For the kind of volume I'm dealing with, that's negligible.

# Batch ingest with Modal GPU
okb-admin ingest ~/notes ~/docs

# Or force local embedding (uses CUDA if available, falls back to CPU)
okb-admin ingest ~/notes --local

The cold start on Modal is the main annoyance, since first call after a while takes 10-20 seconds while the container spins up, and after that it's fast. For the query path this doesn't matter because local embedding of a single short query is near-instant anyway.

I could have skipped Modal entirely, and --local uses your GPU via CUDA if you have one, falling back to CPU if you don't. I work exclusively from a 14-inch laptop, and while it is a fast laptop it's not, you know, fast fast, so for me Modal is almost necessary unless I want to spend my time waiting for embeddings when I bulk import. If you've got a proper workstation or server with a CUDA card, configuring Modal is probably overkill for embeddings.

It's worth noting that Kristoffer landed on the same hybrid approach independently, and his system combines BM25 keyword matching with vector similarity via LanceDB.

Indexing

pgvector supports two index types for approximate nearest-neighbour search: IVFFlat and HNSW. I use HNSW, because it does not need rebuilding when adding new documents. Because it seems to Just Work, I don't think about the index at all, and don't know much more about these than some quick Duckduckgo:ing gave me.

Connecting it via MCP

OKB exposes itself as an MCP server either over stdio (for local tools like Claude Code) or over HTTP with token auth (for browser-based tools or remote setups).

The local configuration (Claude used as example):

{
  "mcpServers": {
    "knowledge-base": {
      "command": "okb-admin",
      "args": ["serve"]
    }
  }
}

That's it. Any MCP-capable tool now has access to search_knowledge, keyword_search, hybrid_search, save_knowledge, and a growing list of others. The tool descriptions tell the LLM what each one does, and the model decides when to use them.

For browser-based or remote access, OKB runs as an HTTP server:

okb token create --db personal -d "browser"
# → okb_personal_rw_a1b2c3d4...

okb serve --http --host 0.0.0.0 --port 8080

This is how my wife and I share a farming KB on the same server, our own (RW) tokens and our own LLM accounts connecting to it. It's also how the same knowledge base serves both my local coding sessions and my browser-based conversations.

The save_knowledge tool for MCP was the killer feature for me, because with it when an LLM learns something useful during a conversation, whether a decision, summary or synthesised insight, it can save it back to the knowledge base, so that next session, that information is searchable.

I also added trigger_sync so the LLM can pull in fresh data from external sources like GitHub issues, Todoist tasks, Slack messages, or Dropbox Paper documents, without me having to run a cron job or remember to re-ingest. All of those are incremental, so reasonably cheap just to run a full resync once in a while, at least for me.

What it's actually like to use

It works, and the rate of 'minor improvements' I add to remove warts and support my workflows has slowed down to a bare trickle. Having the LLM automatically search my notes when I ask about something we discussed previously is really making the LLM more useful, and I no longer maintain a separate document of "things to explain at the start of each session." The MCP integration means it happens with no interruption: I just ask a question, the model searches, I get an answer grounded in my actual notes. With the amount of data I have the semantic search is surprisingly reliable, and I can't recall a case where it failed to surface a relevant document that existed.

The multi-database setup turned out to matter more than I expected, but not for the reason I originally built it. Yes, keeping personal and work separate is good hygiene. But the killer feature is that my wife and I share a farming database – we're running a small operation and she has her own Claude.ai setup pointed at the same KB via the HTTP server. She might add notes about crop planning and tools, I add notes about our apple tree growing project, and both our Claudes can search across all of it. Shared memory for a shared project, without either of us having to forward emails or maintain a document.

Some of the most valuable documents in my knowledge base are the syntheses the LLM created – summaries that connect dots across multiple source documents. OKB has a synthesis pipeline where the LLM proposes connections across your notes, and you approve or reject them before they enter the knowledge base. LLMs are quite good at noticing patterns you've never explicitly connected, and the save_knowledge tool means those insights persist. Less "always right" than the simpler store/retrieve knowledge, but helpful to create more focused knowledge from a large corpus. I started out trying unguided LLM-driven "knowledge enrichment" and "entity extraction", but I could never get the hallucinations to an acceptable level.

Would I recommend building this?

If you're using an MCP-capable LLM regularly and you have a meaningful collection of notes, documents, or project knowledge, then yep! The core (pgvector + chunking + MCP server) is a fun project, and only the quality-of-life features (sync plugins, synthesis, multi-database) took weeks of using-building-testing-discarding-rebuilding and started feeling like work.

OKB itself is open source and installable via pipx if you'd rather just use it than build your own. But the interesting part isn't my specific implementation but the pattern. A local database, a chunking strategy that preserves context, an embedding pipeline that does or doesn't require a permanent GPU, and an MCP server that lets the LLM drive the retrieval.

I can't really work with an LLM anymore without the KB - with it, the LLM moves from smart autocomplete with hallucinations and amnesia towards something more like an over-eager PA with perfect memory (and hallucinations). Even if it can't bring me coffee (Kristoffer's probably can soon), it's still a huge help.

The code is at github.com/haard/okb. Install with pipx install okb. If you build something similar or have questions, I'm @motmanniska@mastodon.nu.