Skip to main content

Giving Your AI a Memory: Building a Local Knowledge Base with pgvector and MCP

·10 mins

I use Claude Code for most of my development work these days, especially investigations and quick-and-dirty spikes. For the most part, it's a huge help – until you realise it has no idea what you talked about yesterday, and Claude.ai does not know what I did in Claude Code. Every session starts from zero, or by throwing lots of reference material at it, filling the context.

This is not a new complaint. Everyone building on LLMs hits the same wall: the context window is finite, and there's no persistence between sessions. My friend and collaborator Kristoffer Nordström had been building exactly this kind of system – he's since written about it here – and hearing him talk about it while it was still in progress was the immediate kick I needed to stop thinking about it and start building. His approach is different from mine (org-mode flat files, LanceDB, local GPU embedding – read his post, it's good), but the core frustration is identical: every new session starts from zero, and the cost of re-loading context into an AI is death by a thousand paper cuts.

Where he optimised for his own workflow and automation / assistant functions, I went the other direction and focused on packaging and doing one thing well. I wanted something anyone could pipx install and have running in five minutes. That meant PostgreSQL instead of bespoke file formats, a plugin architecture for data sources, a CLI that handles the Docker container for you, and an MCP server that works out of the box. While this means that anyone can pick it up and use it, the real motive is that I will be able to continue development after a one-year hiatus, since there are docs, tests, and conventions.

So I built OKB – Owned Knowledge Base (ok, it's a lame backronym). It's a local-first semantic search system backed by PostgreSQL with pgvector, exposed via local or http MCP (with token auth).

MCP

MCP – Model Context Protocol – is an open standard for giving LLMs access to external tools. The LLM decides when to call a tool and what to search for. You provide the tools and their descriptions; the model does the rest. Most major LLMs support it now.

The interesting bit for a knowledge base: because MCP is model-agnostic, the same server acts as shared memory across different tools. I use the same OKB instance from Claude Code, Claude.ai, and there's nothing stopping me from pointing Cursor or Copilot at it too. With the HTTP MCP server, I share a knowledge base with my wife (topic: farming).

The architecture

The core is extremely simple. You need three things: a way to turn documents into vectors, a place to store and search those vectors, and a way to expose that search to an LLM.

PostgreSQL + pgvector

I went with PostgreSQL and the pgvector extension rather than a dedicated vector database. This was a pragmatic choice – I already know Postgres, I already run Postgres, and pgvector is good enough for personal-scale data. We're talking thousands to tens of thousands of documents, not hundreds of millions. At that scale, Postgres handles it fine and you get the bonus of being able to do regular SQL queries alongside vector search.

OKB manages its own Docker container with pgvector pre-installed. okb db start and you're running. The schema supports multiple named databases – I keep personal and work separate – which matters when you're a consultant who doesn't want client notes leaking into personal queries, or don't want your development work to leak into your farming projects.

Contextual chunking

This took me a while to get right, because the tutorials I found assumed knowledge a poor generalist like me just don't have. Apparently you can't just dump an entire document into one vector – embedding models have token limits, and longer texts produce worse embeddings. So you chunk. The naive approach is to split on a fixed character count, but that loses all structure.

OKB does contextual chunking: each chunk carries metadata about where it came from.

Document: Django Performance Notes
Project: student-app          # inferred from path or frontmatter, or set in a MCP call
Section: Query Optimization   # extracted from e.g. org/markdown headers
Topics: django, performance   # from frontmatter tags
Content: Use select_related() to avoid N+1 queries...

The project and section context means that when you search for "N+1 queries in the student app", the chunk about select_related() ranks higher than a generic Django tutorial note – because the metadata matches, not just the content. Frontmatter tags in your markdown files feed directly into this, so the more structured your notes are, the better the retrieval gets. It also means that normal READMEs for projects are perfect for feeding in, as long as they're up to date…

I'm using a default chunk size of 512 tokens with 64-token overlap. Those numbers are tunable, and honestly I arrived at them through experimentation rather than science. Smaller chunks give more precise retrieval but lose context; bigger chunks are more self-contained but dilute the embedding. 512 felt right for the kind of notes I write, and because I mainly use expensive LLMs with (comparatively) large contexts.

Embeddings: the GPU problem

To turn text into vectors, you need an embedding model. Running one locally works but it's slow for bulk operations – fine for searching (one query at a time), painful for ingesting thousands of documents or source files. Some buy a cabinet to put more GPUs in, but that's both expensive and impractical when spending half the year in a van.

My solution was to split the workload: batch ingestion runs on a GPU via Modal, while query-time embedding runs locally. Modal spins up a T4 GPU (configurable) on demand, processes the batch, and shuts down. Cost is roughly $0.02 per thousand chunks. For the kind of volume I'm dealing with – re-ingesting my notes after a schema change, or bulk-importing a new source – that's negligible.

# Batch ingest with Modal GPU
okb-admin ingest ~/notes ~/docs

# Or force local embedding (uses CUDA if available, falls back to CPU)
okb-admin ingest ~/notes --local

The cold start on Modal is the main annoyance – first call after a while takes 10-20 seconds while the container spins up. After that it's fast. For the query path this doesn't matter since local embedding of a single short query is near-instant anyway.

Could I skip Modal entirely? Yes, absolutely. --local uses your GPU via CUDA if you have one, and falls back to CPU if you don't. I work exclusively from a 14-inch laptop – a fast laptop, but you know, not fast fast – so for me Modal is almost necessary unless I want to spend my time waiting for embeddings. If you've got a proper workstation with a CUDA card, --local might be all you need (or just don't configure Modal, it'll fall back to local).

It's worth noting that Kristoffer landed on the same hybrid approach independently – his system combines BM25 keyword matching with vector similarity via LanceDB. When two people solving the same problem both end up with hybrid search, that's probably a signal.

HNSW indexing

pgvector supports two index types for approximate nearest-neighbour search: IVFFlat and HNSW. I use HNSW, because it does not need rebuilding when adding new documents.

In practice: I don't think about the index at all, and don't know much more about these than some quick Duckduckgo:ing gave me.

Connecting it via MCP

OKB exposes itself as an MCP server – either over stdio (for local tools like Claude Code) or over HTTP with token auth (for browser-based tools or remote setups).

The local configuration is minimal:

{
  "mcpServers": {
    "knowledge-base": {
      "command": "okb",
      "args": ["serve"]
    }
  }
}

That's it. Any MCP-capable tool now has access to search_knowledge, keyword_search, hybrid_search, save_knowledge, and a growing list of others. The tool descriptions tell the LLM what each one does, and the model decides when to use them.

For browser-based or remote access, OKB runs as an HTTP server:

okb token create --db personal -d "browser"
# → okb_personal_rw_a1b2c3d4...

okb serve --http --host 0.0.0.0 --port 8080

This is how my wife and I share the farming KB – same server, different tokens, different tools connecting to it. It's also how the same knowledge base serves both my local coding sessions and my browser-based conversations. The MCP server doesn't care who's asking; it just answers queries.

The save_knowledge tool is the one that closes the loop. When an LLM learns something useful during a conversation – a decision, a summary, a synthesised insight – it can save it back to the knowledge base. Next session, that information is searchable. This is the "memory" part: not just retrieval, but accumulation.

I also added trigger_sync so the LLM can pull in fresh data from external sources – GitHub issues, Todoist tasks, Dropbox Paper documents – without me having to run a cron job or remember to re-ingest. All of those are incremental, so reasonably cheap just to run a full resync once in a while, at least for me.

What it's actually like to use

After a few months, some honest observations.

It works. Genuinely. Having the LLM automatically search my notes when I ask about something we discussed previously is a qualitative shift in how useful the tool is. I no longer maintain a separate document of "things to explain at the start of each session." The MCP integration means it happens invisibly – I ask a question, the model searches, I get an answer grounded in my actual notes. With the amount of data I have – thousands of chunks, not millions – the semantic search is surprisingly reliable. I can't recall a case where it failed to surface a relevant document.

The multi-database setup turned out to matter more than I expected, but not for the reason I originally built it. Yes, keeping personal and work separate is good hygiene. But the killer feature is that my wife and I share a farming database – we're running a small operation and she has her own Claude.ai setup pointed at the same KB via the HTTP server. She adds notes about crop planning and tools, I add notes about our apple tree growing project, and both our Claudes can search across all of it. Shared memory for a shared project, without either of us having to forward emails or maintain a document.

The most valuable documents in my knowledge base aren't always my original notes. They're the syntheses the LLM created – summaries that connect dots across multiple source documents. OKB has a synthesis pipeline where the LLM proposes connections across your notes, and you approve or reject them before they enter the knowledge base. LLMs are quite good at noticing patterns you've never explicitly connected, and the save_knowledge tool means those insights persist. It's like having a research assistant who actually files their work.

One thing Kristoffer flagged that I've also noticed: removing all that friction is dangerous. When context-switching costs nothing, you stop taking breaks. He wrote a whole follow-up about deliberately re-introducing friction, and he's right. If you're the type – like us – who tends to just keep going, be warned.

Would I recommend building this?

If you're using an MCP-capable LLM regularly and you have a meaningful collection of notes, documents, or project knowledge – yes, probably. The core (pgvector + chunking + MCP server) is a weekend project if you already know Python and Postgres. The quality-of-life features (sync plugins, synthesis, multi-database) are what took weeks of using-building-testing-discarding-rebuilding.

OKB itself is open source and installable via pipx if you'd rather just use it than build your own. But honestly, the interesting part isn't my specific implementation – it's the pattern. A local database, a chunking strategy that preserves context, an embedding pipeline that doesn't require a permanent GPU, and an MCP server that lets the LLM drive the retrieval. Those pieces snap together in any language and with any vector store.

The bigger insight for me was that giving an AI persistent memory changes the relationship. It stops being a smart autocomplete that you have to re-brief every morning, and starts being something more like an over-eager PA with perfect memory. Even if it can't bring me coffee (Kristoffer's probably can soon), that's worth the Docker container.


The code is at github.com/haard/okb. Install with pipx install okb. If you build something similar or have questions, I'm @motmanniska@mastodon.nu.