Fastest method to delete a huge number of files – Just another Sys Admin blog… wait really ?

Let’s said you have a filesystem with a directory containing a huge number of files, something like half a million. What is the fastest method to delete so many files ?

Never use shell expansion

Before answering the question let’s state the obvious: you should never use shell expansion when dealing with a huge number of files.

cd huge_directory
rm -f *
-bash: /bin/rm: Argument list too long

As you can see rm didn’t even start because shell expansion produce a command that exceeds the ARG_MAX limit (128kb since kernel 2.6.23). So if you insist to use rm for the job (and you shouldn’t) at least do it the right way:

rm -rf huge_directory

It will internally list the files and directories it’s going to delete.

Using find

“If you can’t do it in one setting, divide and make a loop”. That’s the strategy of the find command. First let’s use it with the -exec parameter:

time find ./ -type f -exec rm {} \;
real    14m51.735s
user    2m24.330s
sys     9m48.743s

Not so great. In fact using find this way is very inefficient because it spawn an external rm process for each file ! Luckily, find have it own built-in -delete command:

time find ./ -type f -delete
real    5m11.937s
user    0m1.259s
sys     0m28.441s

That’s much better, but we can do better.

Using rsync

Using rsync for this task may seem a little strange in the first place, but it work really really well:

time rsync -a --delete emptydir/ huge_directory/
real    2m52.502s
user    0m2.772s
sys     0m32.649s

Clearly rsync is the winner. But why ? It’s a little tricky.

When deleting a file you invoke the unlink system call. This call removes a ‘link’ to the file’s inode. Once there is no more ‘link’ the system free the associated space by zeroing the inode. Pretty simple stuff. But how do it know which ‘link’ to unlink ? By using another system call that list the content for a given directory: readdir.

Now here the thing. readdir doesn’t list files in-order but randomly (in really not so randomly because it’s depend on inode number) and by packet of 32Kb. When there is a reasonable number of files that perfectly fine and quite efficient, but not in an over-crowded directory.

This behavior is the main reason why using ls or rm in a directory containing millions of files is such a pain in the ass. Each operation make hundred of readdir calls.

On the other hand rsync doesn’t rely on readdir but use it own implementation. It use a single huge buffer and list all files in-reverse order. This way rsync can unlink files after files without glancing a second time at the directory structure. Huge gain of time when dealing with millions of files.