Let’s said you have a filesystem with a directory containing a huge number of files, something like half a million. What is the fastest method to delete so many files ?
Never use shell expansion
Before answering the question let’s state the obvious: you should never use shell expansion when dealing with a huge number of files.
cd huge_directory
rm -f *
-bash: /bin/rm: Argument list too long
As you can see rm
didn’t even start because shell expansion produce a command that exceeds the ARG_MAX
limit (128kb since kernel 2.6.23). So if you insist to use rm
for the job (and you shouldn’t) at least do it the right way:
rm -rf huge_directory
It will internally list the files and directories it’s going to delete.
Using find
“If you can’t do it in one setting, divide and make a loop”. That’s the strategy of the find
command. First let’s use it with the -exec
parameter:
time find ./ -type f -exec rm {} \;
real 14m51.735s
user 2m24.330s
sys 9m48.743s
Not so great. In fact using find
this way is very inefficient because it spawn an external rm
process for each file ! Luckily, find
have it own built-in -delete
command:
time find ./ -type f -delete
real 5m11.937s
user 0m1.259s
sys 0m28.441s
That’s much better, but we can do better.
Using rsync
Using rsync
for this task may seem a little strange in the first place, but it work really really well:
time rsync -a --delete emptydir/ huge_directory/
real 2m52.502s
user 0m2.772s
sys 0m32.649s
Clearly rsync is the winner. But why ? It’s a little tricky.
When deleting a file you invoke the unlink
system call. This call removes a ‘link’ to the file’s inode. Once there is no more ‘link’ the system free the associated space by zeroing the inode. Pretty simple stuff. But how do it know which ‘link’ to unlink
? By using another system call that list the content for a given directory: readdir
.
Now here the thing. readdir
doesn’t list files in-order but randomly (in really not so randomly because it’s depend on inode number) and by packet of 32Kb. When there is a reasonable number of files that perfectly fine and quite efficient, but not in an over-crowded directory.
This behavior is the main reason why using ls
or rm
in a directory containing millions of files is such a pain in the ass. Each operation make hundred of readdir
calls.
On the other hand rsync
doesn’t rely on readdir
but use it own implementation. It use a single huge buffer and list all files in-reverse order. This way rsync
can unlink
files after files without glancing a second time at the directory structure. Huge gain of time when dealing with millions of files.