Remove identical files in UNIX - shell

I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet).
Would you suggest me an efficient way to do that? I'm working on unix.

you can try this snippet to get all duplicates first before removing.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'

I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.
For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.

Find possible duplicate files:
find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40
Now you can use cmp to check that the files are really identical.

There is an existing tool for this: fdupes
Restoring a solution from an old deleted answer.

Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.

Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum. If the MD5 is the same, then remove the file.
For example, if file b is a duplicate of file a, the md5sum will be the same for both the files.

Related

How do I filter down a subset of files based upon time?

Let's assume I have done lots of work whittling down a list of files in a directory down to the 10 files that I am interested in. There were hundreds of files, and I have finally found the ones I need.
I can either pipe out the results of this (piping from ls), or I can say I have an array of those values (doing this inside a script). Doesn't matter either way.
Now, of that list, I want to find only the files that were created yesterday.
We can use tools like find -mtime 1 which are fine. But how would we do that with a subset of files in a directory? Can we pass a subset to find via xargs?
I can do this pretty easily with a for loop. But I was curious if you smart people knew of a one-liner.
If they're in an array:
files=(...)
find "${files[#]}" -mtime 1
If they're being piped in:
... | xargs -d'\n' -I{} find {} -mtime 1
Note that the second one will run a separate find command for each file which is a bit inefficient.
If any of the items are directories and you don't want to search inside of them, add -maxdepth 0 to disable find's recursion.
Another option that won't recurse, though I'd just use John's find solution if I were you.
$: stat -c "%n %w" "${files[#]}" | sed -n "
/ $(date +'%Y-%m-%d' --date=yesterday) /{ s/ .*//; p; }"
The stat will print the name and creation date of files in the array.
The sed "greps" for the date you want and strips the date info before printing the filename.

Recursively find hexadecimal bytes in binary files

I'm using grep within bash shell to find a series of hexadecimal bytes in files:
$ find . -type f -exec grep -ri "\x5B\x27\x21\x3D\xE9" {} \;
The search works fine, although I know there's a limitation for matches when not using the -a option where results only return:
Binary file ./file_with_bytes matches
I would like to get the offset of the matching result, is this possible? I'm open to using another similar tool I'm just not sure what it would be.
There is actually an option in grep that is available to use
-b --byte-offset Print the 0-based byte offset within the input file
A simple example using this option:
$ grep -obarUP "\x01\x02\x03" /bin
prints out both the filename and byte offset of the matched pattern inside a directory
/bin/bash:772067:
/bin/bash:772099:
/bin/bash:772133:
/bin/bash:772608:
/bin/date:56160:
notice that find is actually not needed since the option -r has already taken care of the recursive file searching
Not at a computer, but use:
od -x yourFile
or
xxd yourFile
to get it dumped in hex with offsets on the left side.
Sometimes your search string may not be found because the characters do not appear contiguously but are split across two lines. You can pass the file through twice though, with the first 4 bytes chopped off the second time to make sure your string is found intact on one pass or the other. Then add the offest back on and sort and uniq the offsets.

Bash script to filter out files based on size

I have a lot of log files which are all unique file names, however based on the size, many are exactly the same content (bot generated attacks).
I need to filter out duplicate file sizes or include only unique file sizes.
95% are not unique and I can see the file sizes, so could manually choose sizes to filter out.
I have worked out
find . -size 48c | xargs ls -lSr -h
Will give me only logs of 48 bytes and could continue with this method to create a long string of included files
uniq does not support file size, as far as I can tell
find does have a not option, this may be where I should be looking?
How can I efficiently filter out the known duplicates?
Or is there a different method to filter and display logs based on unique size only.
One solution is:
find . -type f -ls | awk '!x[$7]++ {print $11}'
$7 is the filesize column; $11 is the pathname.
Since you are using find I assume there are subdirectories, which you don't want to list.
The awk part prints the path of the first file with a given size (only).
HTH
You nearly had it, does going with this provide a solution:
find . -size 48c | xargs

Find files in current directory, list differences from list within script

I am attempting to find differences for a directory and a list of files located in the bash script, for portability.
For example, search a directory with phpBB installed. Compare recursive directory listing to list of core installation files (excluding themes, uploads, etc). Display additional and missing files.
Thus far, I have attempted using diff, comm, and tr with "argument too long" errors. This is likely due to the lists being a list of files it is attempting to compare the actual files rather than the lists themselves.
The file list in the script looks something like this (But I am willing to format differently):
./file.php
./file2.php
./dir/file.php
./dir/.file2.php
I am attempting to use one of the following to print the list:
find ./ -type f -printf "%P\n"
or
find ./ -type f -print
Then use any command you can think of to compare the results to the list of files inside the script.
The following are difficult to use as there are often 1000's of files to check and each version can change the listings and it is a pain to update a whole script every time there is a new release.
find . ! -wholename './file.php' ! -wholename './file2.php'
find . ! -name './file.php' ! -name './file2.php'
find . ! -path './file.php' ! -path './file2.php'
With the lists being in different orders to accommodate any additional files, it can't be a straight comparison.
I'm just stumped. I greatly appreciate any advice or if I could be pointed in the right direction. Ask away for clarification!
You can use the -r option of diff command, to recursively compare the contents of the two directories. This way you don't need all the file names on the command line; just the two top level directory names.
It will give you missing files, newly added files, and the difference of changed files. Many things can be controlled by different options.
If you mean you have a list of expected files somewhere, and only one directory to be compared against it, then you can try using the tree command. The list can be first created using the tree command, and then at the time of comparison you can run the tree command again on the directory, and compare it with the stored "expected output" using the diff command.
Do you have to use coreutils? If so:
Put your list in a file, say list.txt, with one file path per line.
comm -23 <(find path/to/your/directory -type f | sort) \
<(sort path/to/list.txt) \
> diff.txt
diff.txt will have one line per file in path/to/your/directory that is not in your list.
If you care about files in your list that are not in path/to/your/directory, do comm -13 with the same parameters.
Otherwise, you can also use sd (stream diff), which doesn't require sorting nor process substitution and supports infinite streams, like so:
find path/to/your/directory -type f | sd 'cat path/to/list.txt' > diff.txt
And just invert the streams to get the second requirement, like so:
cat path/to/list.txt | sd 'find path/to/your/directory -type f' > diff.txt
Probably not that much of a benefit on this example other than succintness, but still consider it; in some cases you won't be able to use comm nor grep -F nor diff.
Here's a blogpost I wrote about diffing streams on the terminal, which introduces sd.

Need a way to gather total size of JAR files

I am new to more advanced bash commands. I need a way to count the size of external libraries in our codeline. There are a few main directories but I also have a spreadsheet with the actual locations of the libraries that need to be included.
I have fiddled with find and du but it is unclear to me how to specify multiple locations. Can I find the size of several hundred jars listed in the spreadsheet instead of approximating with the main directories?
edit: I can now find the size of specific files. I had to export the excel spreadsheet with the locations to a csv. In PSPad I "joined lines" and copy and paste that directly into the list_of_files slot. (find list_of_files | xargs du -hc). I could not get find to utilize a file containing the locations separated by a space/tab/line.
Now I can't tell if replacing list_of_files with list_of_directories will work. It looks like it counts things twice e.g.
1.0M /folder/dummy/image1.jpg
1.0M /folder/dummy/image2.jpg
2.0M /folder/dummy
3.0M /folder/image3.jpg
7.0M /folder
14.0M total
This is fake but if it's counting like this then that is not what I want. The reason I suspect this is because the total I'm getting seems really high.
Do you mean...
find list_of_directories | xargs du -hc
Then, if you want to exactly pipe to du the files that are listed in the spredsheet you need a way to filter them out. Is it a text file or which format?
find `(cat file)` | xargs du -hc
might do it if they are in a txt file as a list separated by spaces. Probably you will have some issues regarding the spaces... You have to quote the filenames.
for fn in `find DIR1 DIR2 FILE1 -name *.jar`; do du $fn; done | awk '{TOTAL += $1} END {print TOTAL}'
You can specify your files and directories in place of DIR1, DIR2, FILE1, etc. You can list their individual sizes by removing the piped awk command.

Resources