Time complexity of finding duplicate files in bash - bash

I had to write a Bash script to delete duplicate files today, using their md5 hashes. I stored those hashes as files in a temporary directory:
for i in * ; do
hash=$(md5sum /tmp/msg | cut -d " " -f1) ;
if [ -f /tmp/hashes/$hash ] ;
then
echo "Deleted $i" ;
mv $i /tmp/deleted ;
else
touch /tmp/hashes/$hash ;
fi ;
done
It worked perfectly, but led me to wonder: is it a time-efficient way of doing that? I initially thought of storing the MD5 hashes in a file, but then I thought "no, because checking whether a given MD5 is in this file requires to re-read it entirely every time". Now, I wonder: is it the same when using the "create files in a directory" method? Has the Bash [ -f ] check linear, or quasi-constant complexity when there are lots of file in the same directory?
If it depends on the filesystem, what's the complexity on tmpfs?

I'm a fan of using the right tool for the job. In this case, you only want to see duplicate files. I've tested this against several thousand files at my disposal and rereading the file did not seem to have any problems. Plus I noticed that I have hundreds of duplicate files. When I store hashes in separate files and then process this large quantity of files, my system slowly creeps along after about 10,000 hash files in one directory. Having all of the hashes in a single file greatly sped this up.
# This uses md5deep. An alternate is presented later.
md5deep -r some_folder > hashes.txt
# If you do not have md5deep
find . -type f -exec md5sum \{\} \;
This gives you hashes of everything.
cut -b -32 hashes.txt | sort | uniq -d > dupe_hashes.txt
That will use cut to get the hash for each file, sort the hashes, then find any duplicated hashes. Those are written to dupe_hashes.txt without the filenames attached. Now we need to map hashes back to the files.
(for hash in $(cat dupe_hashes.txt); do
grep "^$hash" hashes.txt | tail -n +2 | cut -b 35-
done) > dupe_files.txt
This does not appear to run slowly for me. The Linux kernel does a very good job of keeping files like this in memory instead of reading them from the disk frequently. If you prefer to force this to be in memory, you could just use /dev/shm/hashes.txt instead of hashes.txt. I've found that it was unnecessary in my tests.
That gives you every file that's a duplicate. So far, so good. You'll probably want to review this list. If you want to list the original one as well, remove the tail -n +2 | bit from the command.
When you are comfortable that you can delete every listed file you can pipe things to xargs. This will delete the files in groups of 50.
xargs -L 50 rm < dupe_files.txt

I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.
First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.
To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):
case 'e':
unary_advance ();
return stat (argv[pos - 1], &stat_buf) == 0;
...
case 'f': /* File is a file? */
unary_advance ();
/* Under POSIX, -f is true if the given file exists
and is a regular file. */
return (stat (argv[pos - 1], &stat_buf) == 0
&& S_ISREG (stat_buf.st_mode));
So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.
For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).
tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).
Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.
Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.
storing them in memory
As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.
MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
# already seen this file
fi
faster tests
And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.
what is /usr/bin/[
/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):
#define LBRACKET 1
#include "test.c"
so they're interchangeable.

The choice between reading the contents of a file containing hashes and finding a hash in a directory of filenames that are the hashes basically comes down to "is the kernel quicker at reading a directory or your program at reading a file". Both are going to involve a linear search for each hash, so you end up with much the same behaviour. You can probably argue that the kernel should be a little quicker, but the margin won't be great. Note that most often, the linear search will be exhaustive because the hash won't exist (unless you have lots of duplicate files). So, if you're processing a few thousand files, the searches will process a few million entries overall — it is quadratic behaviour.
If you have many hundreds or thousands of files, you'd probably do better with a two-level hierarchy — for example, a directory containing two-character sub-directories 00 .. FF, and then storing the rest of the name (or the full name) in the sub-directory. A minor variation of this technique is used in the terminfo directories, for example. The advantage is that the kernel only has to read relatively small directories to find whether the file is present or not.

I haven't "hashed" this out, but I'd try storing your md5sums in a bash hash.
See How to define hash tables in Bash?
Store the md5sum as the key, and if you want, the filename as the value. For each file, just see if the key already exists in the hash table. If so, you don't care about the value, but could use it to print out the original duplicate file's name. Then delete the current file (with the duplicate key). Not being a bash expert that's where I'd start looking.

Related

Cannot mount a splitted ISO after concatenation

I have a splitted ISO (7 files of 7GB) which I concatenate with the terminal as
cat /Volumes/Blah/*.iso > /Volumes/Blah/concatenated.iso
I can see that the concatenated.iso file has size 7 x 7GB, but when I use any mounting software afterwards on Mac OSX, (tried Keka & Disk Utility), the mounted disk shows only a size of 7GB & seems to only contain the first one. What am I doing wrong here?
Assuming the split parts do not have headers of their own, and can simply be concatenated:
You need to ensure they are concatenated in the correct order. The order of cat /Volumes/Blah/*.iso is essentially undefined, it's likely not ordered alphabetically. Run echo /Volumes/Blah/*.iso to see what order you actually end up.
So list all the files manually in the correct order, like cat /Volumes/Blah/foo_1.iso cat /Volumes/Blah/foo_2.iso cat /Volumes/Blah/foo_3.iso. If you have a very large number of files you can employ a sub-shell with a for loop but if this is a one-time job you're probably faster just copy/paste/modify-ing the path of each file manually.

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?
I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

How to piece together a file splitted when there is not enough disk space

I was sent 194 Gb worth of data split into 1984 files
There is only 37Gb left on my disk and there are no other disks with that amount of free space. Obviously, this is not going to work
cat file.tar.gz.part* > file.tar.gz
Looking for a way to incrementally piece this huge file together
I might end up writing the script myself but posting here for the community
We need to assume the large file was split using a naming convention
Original file=LargeFile.bin
Split files=(LargeFile.split.aaa, LargeFile.split.aab, ...)
The script to recover would then be:
outfile=LargeFile.recovered.bin
for i in LargeFile.split.* ; do
cat ${i} >> ${outfile}
rm -f ${i}
done
Simple but handy when there is not enough space to do it in one move

List only files that are unencrypted

First off, I am not a Unix expert by any stretch, so please forgive a little naiveity in my question.
I have a requirement to list the unencrypted files in a given directory that potentially contains both encryped and unencrypted files.
I cannot reliably identify these files by file extension alone and was hoping someone in the SO community might be able to help me out.
I can run:
file * | egrep -w 'text|XML'
but that will only identify the files that are either text or XML. I could possibly use this if I can't do much better as currently the only other files in the directry are text or XML files but I really wanted to identify all unencrypted files whatever type they may be.
Is this possible in a single line command?
EDIT: the encrypted files are encrypted via openSSL
The command I use to unencrypt the files is:
openssl -d -aes128 -in <encrypted_filename> -out <unencrypted_filename>
Your problem is not a trivial one. The solaris file command uses "magic" - /etc/magic. This is a set of rules to attempt to attempt to determine what flavor a file is. It is not perfect.
If you read the /etc/magic file, note that the last column is verbiage that is in the output of the file command when it recognizes something, some structure in a file.
Basically the file command looks at the first few bytes of a file, just like the exec() family of system calls does. So, #/bin/sh in the very first line of a file, in the first characters of the line, identifies to exec() the "command interpreter" that exec() needs to invoke to "run" the file. file gets the same idea and says "command text" "awk text" etc.
Your issues are that you have to work out what types of files you are going to see as output from file. You need to spend time delving into the non-encrypted files to see what "answers" you can expect from file. Otherwise you can run file over the whole directory tree and sort out all of what you think are correct answers.
find /path/to/files -type f -exec file {} \; | nawk -F':' '!arr[$2]++' > outputfile
This gives you a list of distinct answers about what file thinks you have. Put the ones you like in a file, call it good.txt
find /path/to/files -type f -exec file {} \; > bigfile
nawk -F':' 'FILENAME=="good.txt" {arr$1]++}
FILENAME=="bigfile" {if($2 in arr) {print $1}} ' good.txt bigfile > nonencryptedfiles.txt
THIS IS NOT 100% guaranteed. file can be fooled.
The way to identify encrypted files is by the amount of randomness, or entropy, they contain. Files that are encrypted (or at least files that are encrypted well) should look random in the statistical sense. Files that contain unencrypted information—whether text, graphics, binary data, or machine code—are not statistically random.
A standard way to calculate randomness is with an autocorrelation function. You'd probably need to autocorrelate only the first few hundred bytes of each file, so the process can be fairly quick.
It's a hack, but you might be able to take advantage of one of the properties of compression algorithms: they work by removing randomness from data. Encrypted files cannot be compressed (or again, at least not much), so you might try compressing some portion of each file and comparing the compression ratios.
SO has several other questions about finding randomness or entropy, and many of them have good suggestions, like this one:
How can I determine the statistical randomness of a binary string?
Good luck!

How can I trim log files using Perl?

I have recently come up with a situation where I need to trim some rather large log files once they grow beyond a certain size. Everything but the last 1000 lines in each file is disposed of, the job is run every half hour by cron. My solution was to simply run through the list of files, check size and trim if necessary.
for $file (#fileList) {
if ( ((-s $file) / (1024 * 1024)) > $CSize) {
open FH, "$file" or die "Cannot open ${file}: $!\n";
$lineNo = 0;
my #tLines;
while(<FH>) {
push #tLines, $_;
shift #tLines if ++$lineNo < CLLimit;
}
close FH;
open FH, ">$file" or die "Cannot write to ${file}: $!\n";
print FH #tLines;
close FH;
}
This works in the current form but there is a lot of overhead for large log files (especially the ones with 100_000+ lines) because of the need to read in each line and shift if necessary.
Is there any way I could read in just a portion of the file, e.g. in this instance I want to be able to access only the last "CLLimit" lines. Since the script is being deployed on a system that has seen better days (think Celeron 700MHz with 64MB RAM) I am looking for a quicker alternative using Perl.
I realize you're wanting to use Perl, but if this is a UNIX system, why not use the "tail" utility to do the trimming? You could do this in BASH with a very simple script:
if [ `stat -f "%z" "$file"` -gt "$MAX_FILE_SIZE" ]; then
tail -1000 $file > $file.tmp
#copy and then rm to avoid inode problems
cp $file.tmp $file
rm $file.tmp
fi
That being said, you would probably find this post very helpful if you're set on using Perl for this.
Estimate the average length of a line in the log - call it N bytes.
Seek backwards from the end of the file by 1000 * 1.10 * N (10% margin for error in the factor 1.10). Read forward from there, keeping just the most recent 1000 lines.
The question was asked - which function or module?
Built-in function seek looks to me like the tool to use?
Consider simply using the logrotate utility; it is included in most modern Linux distributions. A related tool for BSD systems is called newsyslog. These tools are designed more-or-less for your intended purpose: it atomically moves a log file out of place, creates a new file (with the same name as before) to hold new log entries, instructs the program generating messages to use the new file, and then (optionally) compresses the old file. You can configure how many rotated logs to keep. Here's a potential tutorial:
http://www.debian-administration.org/articles/117
It is not precisely the interface you desire (keeping a certain number of lines) but the program will likely be more robust than what you will cook up on your own; for example, the answers here do not deal with atomically moving the file and notifying the log program to use a new file so there is the risk that some log messages are lost.

Resources