How to find, and obliterate, large files in the history of a subversion repository

Sometimes, as I have, you'll find yourself working with colleagues who, through no fault of their own, are either not acquainted with the etiquette of Subversion repository use, or simply have an accident. What you may then end up with is a repository that contains one or more giant blobs of useless data that, really, should never have been added in the first place. Whether or not the culprit well-intentionedly removes these giant blobs in subsequent revisions, you're still left with a huge chunk of nothing-much wasting space on your server's hard drive.

Though a long-standing item on Subversion's wishlist, there is no command that will simply obliterate files from the repository's history. Nevertheless, there is a way to achieve this. Here's how.

The first step of the process is to determine which files need to go. (Some snippets in the following are derived from StackOverflow and Christosoft blog.) First, find the size of each revision in the repository. Replacing REPO appropriately, run this on your server:

REPO=[/absolute/path/to/repo]
for r in `svn log -q file://$REPO | grep ^r | cut -d ' ' -f 1 | tr -d r`
do
    echo "revision $r is " `svn diff -c $r file://$REPO | wc -c` " bytes"
done

Then choose likely candidate revisions, and check to see how big the changed files were:

REV=[12345]
svn list -vR file://$REPO@$REV | grep "^ *$REV" | less

Make a list of each of the files you don't want. To remove those files permanently from the repo:

mv $REPO $REPO.bak
svnadmin dump $REPO.bak > $REPO.dump
svndumpfilter exclude [paths/to/files] [you/want/to/remove] [...] < $REPO.dump > $REPO.dump.filtered
svnadmin create $REPO; svnadmin load $REPO < paper.dump.filtered

Then delete $REPO.bak once you confirm that it's fine. Done! If all went well, the repository should appear as if those files had never been added. (You might call this revisionist revision control.)


Update 2021-05-04: Why not use pipes and avoid the (massive) intermediary disk usage? Observe:

svnadmin create $REPO.new
svnadmin dump $REPO | svndumpfilter exclude [paths/to/files] [etc...] | svnadmin load $REPO.new

If that completes successfully, you can then mv $REPO $REPO.bak (or delete it, if you're feeling brave), and then mv $REPO.new $REPO to replace it.


Comment to add? Send me a message: <brendon@quantumfurball.net>

← Previous | Next →