Cleaning sensitive data from test (PG) databases

As someone who both does a fair bit of data engineering jobs for clients with sensitive (and sometimes really sensitive) data, creating life-like data sets to test, reproduce bugs etc can be challenging as compliance often dictates that data cannot be accessed for any other purpose than the primary business case – so no loading to staging databases that can be reached by testers, and absolutely no loading to local test databases.

To help me I built datamask, a small Python tool that masks PII/PHI in a PostgreSQL database in place, keeping primary and foreign keys intact so the relational structure (and any application code that depends on it) still works.

The workflow is straightforward:

Copy your database somewhere safe (probably on the production server)
datadict introspects a schema and writes a CSV data dictionary listing every column. You edit it, mark the columns that contain PII, and pick a faker type for each.
dataclean reads the edited CSV and rewrites the flagged columns row by row using the chosen fakers.
Dump your now-clean database copy and use it in dev work – I usually put it on some s3-compatible storage.

When the schema changes, you re-run datadict with your previous CSV as a seed so you only have to classify new or modified columns.

It works entirely in Python – rows are pulled out, faked values generated, and written back. That keeps the faker library and the masking logic in one place and makes it easy to add new types, but it also means it is not fast on databases of any meaningful size. For the use case (one-off scrubs of a copy taken from production into a locked-down environment) that has been an acceptable trade-off.

Status: stableish and working, PostgreSQL only, many hacks and TODOs left. Released under a do-what-you-want license. Never run it against a production database unless you feel like testing your backup restore procedure.