Duplicate Files
Tuesday, January 3, 2012
A few months ago, Jon Cooper wrote a duplicate file checker in Go and Ruby.
Below, I contribute a simple version in Factor that runs faster than both Go and Ruby solutions. In the spirit of the original article, I have separated the logic into steps.
Argument Parsing
The command-line vocabulary gives us the arguments passed to the script. We check for the verbose flag and the root directory to traverse:
: arg? ( name args -- args' ? )
2dup member? [ remove t ] [ nip f ] if ;
: parse-args ( -- verbose? root )
"--verbose" command-line get arg? swap first ;
Filesystem Traversal
We can traverse the filesystem with the each-file word (choosing breadth-first instead of depth-first). In our case, we want to collect these files into a map of all paths that share a common filename:
: collect-files ( path -- assoc )
t H{ } clone [
'[ dup file-name _ push-at ] each-file
] keep ;
Our duplicate files are those files that share a common filename:
: duplicate-files ( path -- dupes )
collect-files [ nip length 1 > ] assoc-filter! ;
MD5 Hashing Files
Using the checksums.md5 vocabulary, it is quite simple:
: md5-file ( path -- string )
md5 checksum-file hex-string ;
Printing Results
If verbose is selected, then we print each filename and the MD5 checksum for each full path:
: print-md5 ( name paths -- )
[ "%s:\n" printf ] [
[ dup md5-file " %s\n %s\n" printf ] each
] bi* ;
We put this all together by calculating the possible duplicate files, optionally printing verbose MD5 checksums, and then print the total number of duplicates detected:
: run-dupe ( -- )
parse-args duplicate-files swap
[ dup [ print-md5 ] assoc-each ] when
assoc-size "Total duped files found: %d\n" printf ;
Performance
I tested performance using two directory trees, one with over 500 files and another with almost 36,000 files. While the original article focuses more on syntax than speed, it is nice to see that the Factor solution is faster than the Go and Ruby versions.
Duplicates | Factor | Go | Ruby |
---|---|---|---|
583 | 1.453 | 2.298 | 3.861 |
35,953 | 19.084 | 24.452 | 30.597 |
Note: The above time is seconds on my laptop.
The code for this is on my GitHub.