Searching a 1.2 GB MongoDB dump with Claude

A client handed me a 1.2 GB MongoDB database dump and asked me to find where a specific piece of text was stored. The text — a Dutch veterinary note reading “buikje niet meer zo hard” — was somewhere in there, but neither of us knew which collection it lived in. I’m no MongoDB expert, so my usual instinct of opening a shell and firing off a query wasn’t going to cut it.

The challenge

The dump was a single .out file produced by mongodump --archive. That format is binary BSON, not plain text, which means you can’t just grep through it directly. The database also contained dozens of collections — a mix of named ones like auditlog and bean@apennootje@Patient, and a large number of dynamically named octopus-* collections that store per-practice clinical records.

What we did

Claude suggested a two-step approach. First, use the strings utility to extract all human-readable text from the binary file, then pipe that into grep to find the exact phrase:

strings /tmp/daptexel/apennootje.out | grep -i "buikje niet meer zo hard"

That immediately returned a hit. Next we needed to know which collection the document belonged to. The archive format embeds collection names as strings in the binary, so we found the byte offset of our target text, then scanned backwards through the file for the nearest preceding collection name:

grep -boa "buikje niet meer zo hard" apennootje.out
# → 170936179:buikje niet meer zo hard

grep -boa "octopus-[a-z0-9.]*" apennootje.out 
  | awk -F: '$1 < 170936179 {print}' 
  | tail -3

The last collection reference before byte 170,936,179 was octopus-animana.import — an import collection populated during a data migration from the Animana veterinary practice management system. The octopus-* naming isn’t a MongoDB concept; it’s an application-level convention used by the software to shard clinical records per practice.

Why this was easy with Claude

Without Claude I would have needed to either restore the entire 1.2 GB dump into a running MongoDB instance and then figure out the right query syntax, or somehow know in advance about the strings + byte-offset trick. Claude knew the mongodump archive format, suggested the right Unix tools, and interpreted the surrounding strings to identify the collection — all without needing a live database. For someone who doesn’t live in MongoDB day-to-day, that kind of guided shell work is a significant time-saver.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *