Comparing large numbers of files in Bash quickly - bash

I downloaded many files (~10,000) from a website, most of which are a bunch of useless html that all say the same thing. However, there are some files in this haystack that have useful information (and are thus fairly different files) and I need a quick way to separate those from the rest. I know I can go through all of the files one by one and use cmp to compare to a template and see if they are the same, and the delete them. However, this is rather slow. Is there a faster way to do this? I don't mind if I only have a 99% recovery rate.

This one lists the unique files in the tree passed as the argument:
#!/bin/bash
declare -A uniques
while IFS= read -r file; do
[[ ! "${uniques[${file%% *}]}" ]] && uniques[${file%% *}]="${file##* }"
done< <(find "$1" -type f -exec md5sum -b "{}" \;)
for file in ${uniques[#]}; do
echo "$file"
done
Many thanks to triplee for the better approach using md5sum!
Previous version:
#!/bin/bash
declare -a files uniques
while IFS= read -r -d $'\0' file; do
files[${#files[#]}]="$file"
done< <(find "$1" -type f -print0)
uniques=( ${files[#]} )
for file in "${files[#]}"; do
for unique in "${!uniques[#]}"; do
[[ "$file" != "${uniques[$unique]}" ]] && cmp -s "$file" "${uniques[$unique]}" && && unset -v uniques[$unique]
done
done
for unique in "${uniques[#]}"; do
echo "$unique"
done

Assuming all the files are in or below the current directory, and the template is in the parent directory, and the filenames have no spaces:
find . -type f -print | while read -r filename; do
if ! cmp --quiet $filename ../template; then
echo rm $filename
fi
done
remove the "echo" if you're satisfied this works.

Related

How to write script for incremental backup in Ubuntu?

I want to implement incremental backup in Ubuntu, so I am thinking of finding md5sum of all files from source and target and check if any two files have same md5sum then keep that file in destination else if different copy the file from source into directory.
I am thinking of doing this in bash
Can anyone help me with the commands of how to check md5sum of two files in different directories ?
Thanks in advance!!
#!/bin/bash
#
SOURCE="/home/pallavi/backup1"
DEST="/home/pallavi/BK"
count=1
TODAY=$(date +%F_%H%M%S)
cd "${DEST}" || exit 1
mkdir "${TODAY}"
while [ $count -le 1 ]; do
count=$(( $count + 1 ))
cp -R $SOURCE/* $DEST/$TODAY
mkdir "MD5"
cd ${DEST}/${TODAY}
for f in *;do
md5sum "${f}" >"${TODAY}${f}.md5"
echo ${f}
done
if [ $? -ne 0 ] && [[ $IGNORE_ERR -eq 0 ]]; then
#error or eof
echo "end of source or error"
break
fi
done
This is reinventing the wheel sort of thing.
There are some utility written for this kind of purpose, to name a few.
rsync
GNU cp(1) has the -u flag.
cp
For comparing files
cmp
diff
For finding duplicates
fdupes
rmlint
Here is what I've come up with re-inventing the wheel sort of thing.
#!/usr/bin/env bash
shopt -s extglob
declare -A source_array
while IFS= read -r -d '' files; do
read -r source_hash source_files < <(sha512sum "$files")
source_array["$source_hash"]="$source_files"
done < <(find source/ -type f -print0)
source=$( IFS='|'; printf '%s' "#(${!source_array[*]})" )
while IFS= read -r -d '' files0 ; do
read -r destination_hash destination_files < <(sha512sum "$files0")
if [[ $destination_hash == $source ]]; then
echo "$destination_files" FOUND from source/ directory
else
echo "$destination_files" NOT-FOUND from source/ directory
fi
done < <(find destination/ -type f -print0)
Should be safe enough from files with spaces and tabs and new lines, but since I don't have files with new lines so I can't really say.
Change the action from the if-else statement depending on what you want to do.
Ok maybe sha512sum is a bit over kill, change it to md5sum
Add set -x after the shebang to see what's actually being executed, good luck.

How can I creates array that contains the names of all the files in a folder?

Given a folder (that my script get the of this folder as argument) , how can I creates array that will contain the names of all the files in this folder (and the files that exists at any folder in this folder and the other folder - recursively)?
I tried to do it like that :
#!/bin/bash
function get_all_the_files {
for i in "${1}"/*; do
if [ -d "$i" ]; then
get_all_the_files ${1}
else
if [ -f "${i}" ]; then
arrayNamesOfAllTheFiles=(${arrayNamesOfAllTheFiles[#]} "${i}")
fi
fi
done
}
arrayNamesOfAllTheFiles=()
get_all_the_files folder
declare -p arrayNamesOfAllTheFiles
But it's not working. What is the problem and how can I fix it?
To stick with your design (looping on the files and inserting only the regular files), populating the array at each step, but have Bash perform the recursion via the glob, you can use the following:
# the globstar shell option enables the ** glob pattern for recursion
shopt -s globstar
# the nullglob shell option makes non-matching globs expand to nothing (recommended)
shopt -s nullglob
array=()
for file in /path/to/folder/**; do
if [[ ! -h $file && -f $file ]]; then
array+=( "$file" )
fi
done
With the test [[ ! -h $file && -f $file ]] we test that the file is not a symbolic link and a regular file (without testing that the file is not a symbolic link, you would also have the symbolic links that resolve to a regular file).
You also learned about the array+=( "stuff" ) pattern to append to an array, instead of array=( "${array[#]}" "stuff" ).
Another possibility (with Bash ≥ 4.4 where the -d option of mapfile is implemented) and with GNU find (that supports the -print0 predicate):
mapfile -d '' array < <(find /path/to/folder -type f -print0)
You almost had it right. There is a small typo in the recursive call:
if [ -d "$i" ]; then
get_all_the_files ${1}
else
should be
if [ -d "$i" ]; then
get_all_the_files ${i}
else
I will add that use of arrays like this in bash is very unidiomatic. If you are trying to work with recursive trees of files, its more usual to use tools like find and xargs.
find . -type f -print0 | xargs -0 command-or-script-to-run-on-each-file

Rename files into sequential order when some are missing

I have a bunch of jpg files in a folder named 1.jpg, 2.jpg, 4.jpg, 5.jpg, 8.jpg, 9.jpg and want to rename them to remove the gaps in the sequential order but keep them in the same order.
I've tried:
REORDER=1
for f in *.jpg
do
printf "Moving "$f"\n"
mv -n "$f" "$(date -r "$f" +"$REORDER").jpg"
printf "Moved to "$REORDER"\n"
((REORDER++))
done
But that seems to misbehave and start doing odd things like looping around and renaming 1.jpg again!
Is there a better way to do this without loosing the original order of the files?
You can sort all files numeric and then read one by one and rename:
declare -i index=1
while IFS= read -r -d '' file; do
mv "$file" "$index.jpg"
index=index+1
done< <(find -type f -printf '%f\0' | sort -zn)
Note that the following likely fails if you have newlines in your filenames.
a=( *.jpg ) IFS=$'\n' a=( $(sort -n <<<"${a[*]}") )
for i in "${!a[#]}"; do mv -v "${a[$i]}" "$((i+1)).jpg"; done
This first builds and sorts an array of your files.
Then it walks through that array (whose first index is zero) and renames each file to include the index plus one.
It relies on the fact that bash non-associative arrays maintain index order.
If your filenames contain embedded spaces, don't use this answer. Otherwise it will work fine.
I'm not sure what the point of the call to date is in your script, but this script works for me:
#!/bin/bash
REORDER=1
find . -name '*.jpg' -printf "%f\n" | sort -n | while read f
do
DEST="$REORDER.jpg"
if [ "$DEST" != "$f" ]
then
mv "$f" "$DEST"
fi
((REORDER++))
done
Not that you have to use find because you need to sort the output numerically. If you don't do this, 7.jpg will be processed after 79.jpg is.

find command with filename coming from bash printf builtin not working

I'm trying to do a script which lists files on a directory and then searchs one by one every file in other directory. For dealing with spaces and special characters like "[" or "]" I'm using $(printf %q "$FILENAME") as input for the find command: find /directory/to/search -type f -name $(printf %q "$FILENAME").
It works like a charm for every filename except in one case: when there's multibyte characters (UTF-8). In that case the output of printf is an external quoted string, i.e.: $'file name with blank spaces and quoted characters in the form of \NNN\NNN', and that string is not being expanded without the $'' quoting, so find searchs for a file with a name including that quote: «$'filename'».
Is there an alternative solution in order to be able to pass to find any kind of filename?
My script is like follows (I know some lines can be deleted, like the "RESNAME="):
#!/bin/bash
if [ -d $1 ] && [ -d $2 ]; then
IFSS=$IFS
IFS=$'\n'
FILES=$(find $1 -type f )
for FILE in $FILES; do
BASEFILE=$(printf '%q' "$(basename "$FILE")")
RES=$(find $2 -type f -name "$BASEFILE" -print )
if [ ${#RES} -gt 1 ]; then
RESNAME=$(printf '%q' "$(basename "$RES")")
else
RESNAME=
fi
if [ "$RESNAME" != "$BASEFILE" ]; then
echo "FILE NOT FOUND: $FILE"
fi
done
else
echo "Directories do not exist"
fi
IFS=$IFSS
As an answer said, I've used associative arrays, but with no luck, maybe I'm not using correctly the arrays, but echoing it (array[#]) returns nothing. This is the script I've written:
#!/bin/bash
if [ -d "$1" ] && [ -d "$2" ]; then
declare -A files
find "$2" -type f -print0 | while read -r -d $'\0' FILE;
do
BN2="$(basename "$FILE")"
files["$BN2"]="$BN2"
done
echo "${files[#]}"
find "$1" -type f -print0 | while read -r -d $'\0' FILE;
do
BN1="$(basename "$FILE")"
if [ "${files["$BN1"]}" != "$BN1" ]; then
echo "File not found: "$BN1""
fi
done
fi
Don't use for loops. First, it is slower. Your find has to complete before the rest of your program can run. Second, it is possible to overload the command line. The enter for command must fit in the command line buffer.
Most importantly of all, for sucks at handling funky file names. You're running conniptions trying to get around this. However:
find $1 -type f -print0 | while read -r -d $'\0' FILE
will work much better. It handles file names -- even file names that contain \n characters. The -print0 tells find to separate file names with the NUL character. The while read -r -d $'\0 FILE will read each file name (separate by the NUL character) into $FILE.
If you put quotes around the file name in the find command, you don't have to worry about special characters in the file names.
Your script is running find once for each file found. If you have 100 files in your first directory, you're running find 100 times.
Do you know about associative (hash) arrays in BASH? You are probably better off using associative arrays. Run find on the first directory, and store those files names in an associative array.
Then, run find (again using the find | while read syntax) for your second directory. For each file you find in the second directory, see if you have a matching entry in your associative array. If you do, you know that file is in both arrays.
Addendum
I've been looking at the find command. It appears there's no real way to prevent it from using pattern matching except through a lot of work (like you were doing with printf. I've tried using the -regex matching and using \Q and \E to remove the special meaning of pattern characters. I haven't been successful.
There comes a time that you need something a bit more powerful and flexible than shell to implement your script, and I believe this is the time.
Perl, Python, and Ruby are three fairly ubiquitous scripting languages found on almost all Unix systems and are available on other non-POSIX platforms (cough! ...Windows!... cough!).
Below is a Perl script that takes two directories, and searches them for matching files. It uses the find command once and uses associative arrays (called hashes in Perl). I key the hash to the name of my file. In the value portion of the hash, I store an array of the directories where I found this file.
I only need to run the find command once per directory. Once that is done, I can print out all the entries in the hash that contain more than one directory.
I know it's not shell, but this is one of the cases where you can spend a lot more time trying to figure out how to get shell to do what you want than its worth.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use File::Find;
use constant DIRECTORIES => qw( dir1 dir2 );
my %files;
#
# Perl version of the find command. You give it a list of
# directories and a subroutine for filtering what you find.
# I am basically rejecting all non-file entires, then pushing
# them into my %files hash as an array.
#
find (
sub {
return unless -f;
$files{$_} = [] if not exists $files{$_};
push #{ $files{$_} }, $File::Find::dir;
}, DIRECTORIES
);
#
# All files are found and in %files hash. I can then go
# through all the entries in my hash, and look for ones
# with more than one directory in the array reference.
# IF there is more than one, the file is located in multiple
# directories, and I print them.
#
for my $file ( sort keys %files ) {
if ( #{ $files{$file} } > 1 ) {
say "File: $file: " . join ", ", #{ $files{$file} };
}
}
Try something like this:
find "$DIR1" -printf "%f\0" | xargs -0 -i find "$DIR2" -name \{\}
How about this one-liner?
find dir1 -type f -exec bash -c 'read < <(find dir2 -name "${1##*/}" -type f)' _ {} \; -printf "File %f is in dir2\n" -o -printf "File %f is not in dir2\n"
Absolutely 100% safe regarding files with funny symbols, newlines and spaces in their name.
How does it work?
find (the main one) will scan through directory dir1 and for each file (-type f) will execute
read < <(find dir2 -name "${1##*/} -type f")
with argument the name of the current file given by the main find. This argument is at position $1. The ${1##*/} removes everything before the last / so that if $1 is path/to/found/file the find statement is:
find dir2 -name "file" -type f
This outputs something if file is found, otherwise has no output. That's what is read by the read bash command. read's exit status is true if it was able to read something, and false if there wasn't anything read (i.e., in case nothing is found). This exit status becomes bash's exit status which becomes -exec's status. If true, the next -printf statement is executed, and if false, the -o -printf part will be executed.
If your dirs are given in variables $dir1 and $dir2 do this, so as to be safe regarding spaces and funny symbols that could occur in $dir2:
find "$dir1" -type f -exec bash -c 'read < <(find "$0" -name "${1##*/}" -type f)' "$dir2" {} \; -printf "File %f is in $dir2\n" -o -printf "File %f is not in $dir2\n"
Regarding efficiency: this is of course not an efficient method at all! the inner find will be executed as many times as there are found files in dir1. This is terrible, especially if the directory tree under dir2 is deep and has many branches (you can rely a little bit on caching, but there are limits!).
Regarding usability: you have fine-grained control on how both find's work and on the output, and it's very easy to add many more tests.
So, hey, tell me how to compare files from two directories? Well, if you agree on loosing a little bit of control, this will be the shortest and most efficient answer:
diff dir1 dir2
Try it, you'll be amazed!
Since you are only using find for its recursive directory following, it will be easier to simply use the globstar option in bash. (You're using associative arrays, so your bash is new enough).
#!/bin/bash
shopt -s globstar
declare -A files
if [[ -d $1 && -d $2 ]]; then
for f in "$2"/**/*; do
[[ -f "$f" ]] || continue
BN2=$(basename "$f")
files["$BN2"]=$BN2
done
echo "${files[#]}"
for f in "$1"/**/*; do
[[ -f "$f" ]] || continue
BN1=$(basename $f)
if [[ ${files[$BN1]} != $BN1 ]]; then
echo "File not found: $BN1"
fi
done
fi
** will match zero or more directories, so $1/**/* will match all the files and directories in $1, all the files and directories in those directories, and so forth all the way down the tree.
If you want to use associative arrays, here's one possibility that will work well with files with all sorts of funny symbols in their names (this script has too much to just show the point, but it is usable as is – just remove the parts you don't want and adapt to your needs):
#!/bin/bash
die() {
printf "%s\n" "$#"
exit 1
}
[[ -n $1 ]] || die "Must give two arguments (none found)"
[[ -n $2 ]] || die "Must give two arguments (only one given)"
dir1=$1
dir2=$2
[[ -d $dir1 ]] || die "$dir1 is not a directory"
[[ -d $dir2 ]] || die "$dir2 is not a directory"
declare -A dir1files
declare -A dir2files
while IFS=$'\0' read -r -d '' file; do
dir1files[${file##*/}]=1
done < <(find "$dir1" -type f -print0)
while IFS=$'\0' read -r -d '' file; do
dir2files[${file##*/}]=1
done < <(find "$dir2" -type f -print0)
# Which files in dir1 are in dir2?
for i in "${!dir1files[#]}"; do
if [[ -n ${dir2files[$i]} ]]; then
printf "File %s is both in %s and in %s\n" "$i" "$dir1" "$dir2"
# Remove it from dir2 has
unset dir2files["$i"]
else
printf "File %s is in %s but not in %s\n" "$i" "$dir1" "$dir2"
fi
done
# Which files in dir2 are not in dir1?
# Since I unset them from dir2files hash table, the only keys remaining
# correspond to files in dir2 but not in dir1
if [[ -n "${!dir2files[#]}" ]]; then
printf "File %s is in %s but not in %s\n" "$dir2" "$dir1" "${!dir2files[#]}"
fi
Remark. The identification of files is only based on their filenames, not their contents.

How do I copy directory structure containing placeholders

I have the situation, where a template directory - containing files and links (!) - needs to be copied recursively to a destination directory, preserving all attributes. The template directory contains any number of placeholders (__NOTATION__), that need to be renamed to certain values.
For example template looks like this:
./template/__PLACEHOLDER__/name/__PLACEHOLDER__/prog/prefix___FILENAME___blah.txt
Destination becomes like this:
./destination/project1/name/project1/prog/prefix_customer_blah.txt
What I tried so far is this:
# first create dest directory structure
while read line; do
dest="$(echo "$line" | sed -e 's#__PLACEHOLDER__#project1#g' -e 's#__FILENAME__#customer#g' -e 's#template#destination#')"
if ! [ -d "$dest" ]; then
mkdir -p "$dest"
fi
done < <(find ./template -type d)
# now copy files
while read line; do
dest="$(echo "$line" | sed -e 's#__PLACEHOLDER__#project1#g' -e 's#__FILENAME__#customer#g' -e 's#template#destination#')"
cp -a "$line" "$dest"
done < <(find ./template -type f)
However, I realized that if I want to take care about permissions and links, this is going to be endless and very complicated. Is there a better way to replace __PLACEHOLDER__ with "value", maybe using cp, find or rsync?
I suspect that your script will already do what you want, if only you replace
find ./template -type f
with
find ./template ! -type d
Otherwise, the obvious solution is to use cp -a to make an "archive" copy of the template, complete with all links, permissions, etc, and then rename the placeholders in the copy.
cp -a ./template ./destination
while read path; do
dir=`dirname "$path"`
file=`basename "$path"`
mv -v "$path" "$dir/${file//__PLACEHOLDER__/project1}"
done < <(`find ./destination -depth -name '*__PLACEHOLDER__*'`)
Note that you'll want to use -depth or else renaming files inside renamed directories will break.
If it's very important to you that the directory tree is created with the names already changed (i.e. you must never see placeholders in the destination), then I'd recommend simply using an intermediate location.
First copy with rsync, preserving all the properties and links etc.
Then change the placeholder strings in the destination filenames:
#!/bin/bash
TEMPL="$PWD/template" # somewhere else
DEST="$PWD/dest" # wherever it is
mkdir "$DEST"
(cd "$TEMPL"; rsync -Hra . "$DEST") #
MyRen=$(mktemp)
trap "rm -f $MyRen" 0 1 2 3 13 15
cat >$MyRen <<'EOF'
#!/bin/bash
fn="$1"
newfn="$(echo "$fn" | sed -e 's#__PLACEHOLDER__#project1#g' -e s#__FILENAME__#customer#g' -e 's#template#destination#')"
test "$fn" != "$newfn" && mv "$fn" "$newfn"
EOF
chmod +x $MyRen
find "$DEST" -depth -execdir $MyRen {} \;

Resources