How to return an MD5 and SHA1 value for multiple files in a directory using BASH - bash

I am creating a BASH script to take a directory as an argument and return to std out a list of all files in that directory with both the MD5 and SHA1 value of the files present in that directory. The only files I'm interested in are those between 100 and 500K. So far I gotten this far. (Section of Script)
cd $1 &&
find . -type f -size +100k -size -500k -printf '%f \t %s \t' -exec md5sum {} \; |
awk '{printf "NAME:" " " $1 "\t" "MD5:" " " $3 "\t" "BYTES:" "\t" $2 "\n"}'
I'm getting a little confused when adding the Sha1 and obviously leaving something out.
Can anybody suggest a way to achieve this.
Ideally I'd like the script to format in the following way
Name Md5 SHA1
(With the relevant fields underneath)

Your awk printf bit is overly complicated. Try this:
find . -type f -printf "%f\t%s\t" -exec md5sum {} \; | awk '{ printf "NAME: %s MD5: %s BYTES: %s\n", $1, $3, $2 }'

Just read line by line the list of files outputted by find:
find . -type f |
while IFS= read -r l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
It's better to use a zero separated stream:
find . -type f -print0 |
while IFS= read -r -d '' l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
You could speed up something with xargs and multiple processes with -P option to xargs:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' --
Consider adding -maxdepth 1 to find if you are not interested in files in subdirectories recursively.
It's easy from xargs to go to -exec:
find . -type f -exec sh -c 'echo "$1 $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' -- {} \;
Tested on repl.
Add those -size +100k -size -500k args to find to limit the sizes.
The | cut -d" " -f1 is used to remove the - that is outputted by both md5sum and sha1sum. If there are no spaces in filenames, you could run a single cut process for the whole stream, so it should be slightly faster:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1") $(sha1sum <"$1")"' -- |
cut -d" " -f1,2,5
I also think that running a single md5sum and sha1sum process maybe would be faster rather then spawning multiple separate processes for each file, but such method needs storing all the filenames somewhere. Below a bash array is used:
IFS=$'\n' files=($(find . -type f))
paste -d' ' <(
printf "%s\n" "${files[#]}") <(
md5sum "${files[#]}" | cut -d' ' -f1) <(
sha1sum "${files[#]}" | cut -d' ' -f1)

Your find is fine, you want to join the results of two of those, one for each hash. The command for that is join, which expects sorted inputs.
doit() { find -type f -size +100k -size -500k -exec $1 {} + |sort -k2; }
join -j2 <(doit md5sum) <(doit sha1sum)
and that gets you the raw data in sane environments. If you want pretty data, you can use the column utility:
join -j2 <(doit md5sum) <(doit sha1sum) | column -t
and add nice headers:
(echo Name Md5 SHA1; join -j2 <(doit md5sum) <(doit sha1sum)) | column -t
and if you're in an unclean environment where people put spaces in file names, protect against that by subbing in tabs for the field markers:
doit() { find -type f -size +100k -size -500k -exec $1 {} + \
| sed 's, ,\t,'| sort -k2 -t$'\t' ; }
join -j2 -t$'\t' <(doit md5sum) <(doit sha1sum) | column -ts$'\t'

Related

Finding most recent file from a list of directories from find command

I use find . -type d -name "Financials" to find all the directories called "Financials" under the current directory. Since I am on Mac, I can use the following (which I found from another stackoverflow question) to find the latest modified file in my current directory: find . -type f -print0 | xargs -0 stat -f "%m %N" | sort -rn | head -1 | cut -f2- -d" ". What I would like to do is find a way to pipe the results of the first command into the second command--i.e. to find the most recently modified file in each "Financials" directory. Is there a way to do this?
I think you could:
find . -type d -name "Financials" -print0 |
xargs -0 -I{} find {} -type f -print0 |
xargs -0 stat -f "%m %N" | sort -rn | head -1 | cut -f2- -d" "
But if you want separately for each dir, then... why not just loop it:
find . -type d -name "Financials" |
while IFS= read -r dir; do
echo "newest file in $dir is $(
find "$dir" -type f -print0 |
xargs -0 stat -f "%m %N" | sort -rn | head -1 | cut -f2- -d" "
)"
done
Nest the 2nd file+xargs inside a first find+xargs:
find . -type d -name "Financials" -print0 \
| xargs -0 sh -c '
for d in "$#"; do
find "$d" -type f -print0 \
| xargs -0 stat -f "%m %N" \
| sort -rn \
| head -1 \
| cut -f2- -d" "
done
' sh
Note the trailing "sh" in sh -c '...' sh -- that word becomes "$0" inside the shell script so the for-loop can iterate over all the directories.
A robust way that will also avoid problems with funny filenames that contain special characters is:
find all files within this particular subdirectory, and extract the inode number and modifcation time
$ find . -type f -ipath '*/Financials/*' -printf "%T# %i\n"
extract youngest file's inode number
$ ... | awk '($1>t){t=$1;i=$2}END{print i}'
search file information by inode number
$ find . -type f -inum "$inode" '*/Financials/*'
So this gives you:
$ inode="$(find . -type f -ipath '*/Financials/*' -printf "%T# %i\n" | awk '($1>t){t=$1;i=$2}END{print i}')"
$ find . -type f -inum "$inode" '*/Financials/*'

Getting a list of substring based unique filenames in an array

I have a directory my_dir with files having names like:
a_v5.json
a_v5.mapping.json
a_v5.settings.json
f_v39.json
f_v39.mapping.json
f_v39.settings.json
f_v40.json
f_v40.mapping.json
f_v40.settings.json
c_v1.json
c_v1.mapping.json
c_v1.settings.json
I'm looking for a way to get an array [a_v5, f_v40, c_v1] in bash. Here, array of file names with the latest version number is what I need.
Tried this: ls *.json | find . -type f -exec basename "{}" \; | cut -d. -f1, but it returns the results with files which are not of the .json extension.
You can use the following command if your filenames don't contain whitespace and special symbols like * or ?:
array=($(
find . -type f -iname \*.json |
sed -E 's|(.*/)*(.*_v)([0-9]+)\..*|\2 \3|' |
sort -Vr | sort -uk1,1 | tr -d ' '
))
It's ugly and unsafe. The following solution is longer but can handle all file names, even those with linebreaks in them.
maxversions() {
find -type f -iname \*.json -print0 |
gawk 'BEGIN { RS = "\0"; ORS = "\0" }
match($0, /(.*\/)*(.*_v)([0-9]+)\..*/, group) {
prefix = group[2];
version = group[3];
if (version > maxversion[prefix])
maxversion[prefix] = version
}
END {
for (prefix in maxversion)
print prefix maxversion[prefix]
}'
}
mapfile -d '' array < <(maxversions)
In both cases you can check the contents of array with declare -p array.
Arrays and bash string parsing.
declare -A tmp=()
for f in $SOURCE_DIR/*.json
do f=${f##*/} # strip path
tmp[${f%%.*}]=1 # strip extraneous data after . in filename
done
declare -a c=( $( printf "%s\n" "${!tmp[#]}" | cut -c 1 | sort -u ) ) # get just the first chars
declare -a lst=( $( for f in "${c[#]}"
do printf "%s\n" "${!tmp[#]}" |
grep "^${f}_" |
sort -n |
tail -1; done ) )
echo "[ ${lst[#]} ]"
[ a_v5 c_v1 f_v40 ]
Or, if you'd rather,
declare -a arr=( $(
for f in $SOURCE_DIR/*.json
do d=${f%/*} # get dir path
f=${f##*/} # strip path
g=${f:0:2} # get leading str
( cd $d && printf "%s\n" ${g}*.json |
sort -n | sed -n '$ { s/[.].*//; p; }' )
done | sort -u ) )
echo "[ ${arr[#]} ]"
[ a_v5 c_v1 f_v40 ]
This is one possible way to accomplish this :
arr=( $( { for name in $( ls {f,n,m}*.txt ); do echo ${name:0:1} ; done; } | sort | uniq ) )
Output :
$ echo ${arr[0]}
f
$ echo ${arr[1]}
m
$ echo ${arr[2]}
n
Regards!
AWK SOLUTION
This is not an elegant solution... my knowledge of awk is limited.
You should find this functional.
I've updated this to remove the obsolete uniq as suggested by #socowi.
I've also included the printf version as #socowi suggested.
ls *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
OR
printf %s\\n *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
Old understanding below
Find files with name matching pattern.
Now take the second field since your results will likely be similar to ./
find . -type f -iname "*.json" | cut -d. -f2
To get the unique headings....
find . -type f -iname "*.json" | cut -d. -f2 | sort | uniq

Perform a CAT in FOR and SSH

I do not have much experience in shell script, therefore I need your help. I have the following query, I need to make a CAT to the files that I list, but I have not managed to know where to place the command. Thank you:
read date
echo -e "RECORDINGS"
for e in $Rec
do
sshpass -p password ssh user#server find $e "-type f -mtime -10 -exec ls -gGh --full-time {} \;" | cut -d ' ' -f 4,7 | grep $date | awk -F " " '{print $2}'
done
Ignoring that much of the code here is outright dangerous --
find_on_server() {
local e_q
printf -v e_q '%q ' "$1"
sshpass -p password ssh user#server "bash -s $e_q" <<'EOF'
e=$1
find "$e" -type f -mtime -10 -exec ls -gGh --full-time {} \;
EOF
# ^^^ the above line MUST NOT BE INDENTED
}
find_on_server "$e" | cut -d ' ' -f 4,7 | grep $date | awk -F " " '{print $2}'

Why does "... >> out | sort -n -o out" not actually run sort?

As an exercise, I should find all .c files starting from my home directory, count the lines of each file and store the sorted output in sorted_statistics.txt, using find, wc, cut ad sort.
I found this command to work
find /home/user/ -type f -name "*.c" 2> /dev/null -exec wc -l {} \; | cut -f 1 -d " " | sort -n -o sorted_statistics.txt
but I can't understand why
find /home/user/ -type f -name "*.c" 2> /dev/null -exec wc -l {} \; | cut -f 1 -d " " >> sorted_statistics.txt | sort -n sorted_statistics.txt
stops just before the sort command.
Just out of curiosity, why is that?
You were appending everything to sorted_statistics.txt ( consuming all the output ) and then trying to use that none existing output in a pipe for sort. I have corrected your code so it works now.
find /home/user/ -type f -name "*.c" 2> /dev/null -exec wc -l {} \; | cut -f 1 -d " " >> tmp.txt && sort -n tmp.txt > sorted_statistics.txt
Regards!
This part of the command makes no sense:
cut -f 1 -d " " >> sorted_statistics.txt | sort ...
because the output of cut is appended to the file sorted_statistics.txt and no output at all goes to the sort command. You will probably want to use tee:
cut -f 1 -d " " | tee -a sorted_statistics.txt | sort ...
The tee command sends its input to a file and also to the standard output. It is like a Tee junction in a pipeline.

Getting file size in bytes with bash (Ubuntu)

Hi, i'm looking for a way to output a filesize in bytes. Whatever i try i will get either 96 or 96k instead of 96000.
if [[ -d $1 ]]; then
largestN=$(find $1 -depth -type f | tr '\n' '\0' | du -s --files0-from=- | sort | tail -n 1 | awk '{print $2}')
largestS=$(find $1 -depth -type f | tr '\n' '\0' | du -h --files0-from=- | sort | tail -n 1 | awk '{print $1}')
echo "The largest file is $largestN which is $largestS bytes."
else
echo "$1 is not a directory..."
fi
This prints "The largest file [file] is 96k bytes"
there is -b option for this
$ du -b ...
Looks like you're trying to find the largest file in a given directory. It's more efficient (and shorter) to let find do the heavy lifting for you:
find $1 -type f -printf '%s %p\n' | sort -n | tail -n1
Here, %s expands to the size in bytes of the file, and %p expands to the name of the file.

Resources