Get second part of output separated by two spaces - bash

I have this script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | sort | uniq -D -w 32
It outputs this:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2-2.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2.txt
Now I only want to save the last part (the path) in an array.
When I add this after the sort
| awk -F " " '{ print $1 }'
I get this as output:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change the $1 to $2, I get nothing, but I want to get the path of the file.
How should I do this?
EDIT:
This script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | awk '{ print $1 }' | sort | uniq -D -w 32
Outputs this
parallels#mbp:~/bin$ duper ./dups
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change it to $2 it outputs this
parallels#mbp:~/bin$ duper ./dups
parallels#mbp:~/bin$
Expected Output
./dups/dup1-1.txt
./dups/dup1.txt
./dups/subdups/dup2-2.txt
./dups/subdups/dup2.txt
There are some files in the directory that are no duplicates of each other. Such as nodup1.txt and nodup2.txt. That's why it doesn't show up.

Change your find command to this:
find "$path" -type f -exec sha1sum {} \; | uniq -D -w 41 | awk '{print $2}' | sort
I moved the uniq as the first filter and it is taking into consideration just the first 41 characters, aiming to match just the sha1sum hash.

You can achieve the same result piping to tr and then cut:
echo '3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt' |\
tr -s ' ' | cut -d ' ' -f 2
Outputs:
./dups/dup1-1.txt
-s ' ' on tr is to squeeze spaces
-d ' ' -f 2 on cut is to output the second field delimited by spaces

I like to use cut for stuff like this. With this input:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
I'd do cut -d ' ' -f 2 which should return:
./dups/dup1-1.txt
I haven't tested it though for your case.
EDIT: Gonzalo Matheu's answer is better as he ensured to remove any extra spaces between your outputs before doing the cut.

Related

How to return an MD5 and SHA1 value for multiple files in a directory using BASH

I am creating a BASH script to take a directory as an argument and return to std out a list of all files in that directory with both the MD5 and SHA1 value of the files present in that directory. The only files I'm interested in are those between 100 and 500K. So far I gotten this far. (Section of Script)
cd $1 &&
find . -type f -size +100k -size -500k -printf '%f \t %s \t' -exec md5sum {} \; |
awk '{printf "NAME:" " " $1 "\t" "MD5:" " " $3 "\t" "BYTES:" "\t" $2 "\n"}'
I'm getting a little confused when adding the Sha1 and obviously leaving something out.
Can anybody suggest a way to achieve this.
Ideally I'd like the script to format in the following way
Name Md5 SHA1
(With the relevant fields underneath)
Your awk printf bit is overly complicated. Try this:
find . -type f -printf "%f\t%s\t" -exec md5sum {} \; | awk '{ printf "NAME: %s MD5: %s BYTES: %s\n", $1, $3, $2 }'
Just read line by line the list of files outputted by find:
find . -type f |
while IFS= read -r l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
It's better to use a zero separated stream:
find . -type f -print0 |
while IFS= read -r -d '' l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
You could speed up something with xargs and multiple processes with -P option to xargs:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' --
Consider adding -maxdepth 1 to find if you are not interested in files in subdirectories recursively.
It's easy from xargs to go to -exec:
find . -type f -exec sh -c 'echo "$1 $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' -- {} \;
Tested on repl.
Add those -size +100k -size -500k args to find to limit the sizes.
The | cut -d" " -f1 is used to remove the - that is outputted by both md5sum and sha1sum. If there are no spaces in filenames, you could run a single cut process for the whole stream, so it should be slightly faster:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1") $(sha1sum <"$1")"' -- |
cut -d" " -f1,2,5
I also think that running a single md5sum and sha1sum process maybe would be faster rather then spawning multiple separate processes for each file, but such method needs storing all the filenames somewhere. Below a bash array is used:
IFS=$'\n' files=($(find . -type f))
paste -d' ' <(
printf "%s\n" "${files[#]}") <(
md5sum "${files[#]}" | cut -d' ' -f1) <(
sha1sum "${files[#]}" | cut -d' ' -f1)
Your find is fine, you want to join the results of two of those, one for each hash. The command for that is join, which expects sorted inputs.
doit() { find -type f -size +100k -size -500k -exec $1 {} + |sort -k2; }
join -j2 <(doit md5sum) <(doit sha1sum)
and that gets you the raw data in sane environments. If you want pretty data, you can use the column utility:
join -j2 <(doit md5sum) <(doit sha1sum) | column -t
and add nice headers:
(echo Name Md5 SHA1; join -j2 <(doit md5sum) <(doit sha1sum)) | column -t
and if you're in an unclean environment where people put spaces in file names, protect against that by subbing in tabs for the field markers:
doit() { find -type f -size +100k -size -500k -exec $1 {} + \
| sed 's, ,\t,'| sort -k2 -t$'\t' ; }
join -j2 -t$'\t' <(doit md5sum) <(doit sha1sum) | column -ts$'\t'

Getting a list of substring based unique filenames in an array

I have a directory my_dir with files having names like:
a_v5.json
a_v5.mapping.json
a_v5.settings.json
f_v39.json
f_v39.mapping.json
f_v39.settings.json
f_v40.json
f_v40.mapping.json
f_v40.settings.json
c_v1.json
c_v1.mapping.json
c_v1.settings.json
I'm looking for a way to get an array [a_v5, f_v40, c_v1] in bash. Here, array of file names with the latest version number is what I need.
Tried this: ls *.json | find . -type f -exec basename "{}" \; | cut -d. -f1, but it returns the results with files which are not of the .json extension.
You can use the following command if your filenames don't contain whitespace and special symbols like * or ?:
array=($(
find . -type f -iname \*.json |
sed -E 's|(.*/)*(.*_v)([0-9]+)\..*|\2 \3|' |
sort -Vr | sort -uk1,1 | tr -d ' '
))
It's ugly and unsafe. The following solution is longer but can handle all file names, even those with linebreaks in them.
maxversions() {
find -type f -iname \*.json -print0 |
gawk 'BEGIN { RS = "\0"; ORS = "\0" }
match($0, /(.*\/)*(.*_v)([0-9]+)\..*/, group) {
prefix = group[2];
version = group[3];
if (version > maxversion[prefix])
maxversion[prefix] = version
}
END {
for (prefix in maxversion)
print prefix maxversion[prefix]
}'
}
mapfile -d '' array < <(maxversions)
In both cases you can check the contents of array with declare -p array.
Arrays and bash string parsing.
declare -A tmp=()
for f in $SOURCE_DIR/*.json
do f=${f##*/} # strip path
tmp[${f%%.*}]=1 # strip extraneous data after . in filename
done
declare -a c=( $( printf "%s\n" "${!tmp[#]}" | cut -c 1 | sort -u ) ) # get just the first chars
declare -a lst=( $( for f in "${c[#]}"
do printf "%s\n" "${!tmp[#]}" |
grep "^${f}_" |
sort -n |
tail -1; done ) )
echo "[ ${lst[#]} ]"
[ a_v5 c_v1 f_v40 ]
Or, if you'd rather,
declare -a arr=( $(
for f in $SOURCE_DIR/*.json
do d=${f%/*} # get dir path
f=${f##*/} # strip path
g=${f:0:2} # get leading str
( cd $d && printf "%s\n" ${g}*.json |
sort -n | sed -n '$ { s/[.].*//; p; }' )
done | sort -u ) )
echo "[ ${arr[#]} ]"
[ a_v5 c_v1 f_v40 ]
This is one possible way to accomplish this :
arr=( $( { for name in $( ls {f,n,m}*.txt ); do echo ${name:0:1} ; done; } | sort | uniq ) )
Output :
$ echo ${arr[0]}
f
$ echo ${arr[1]}
m
$ echo ${arr[2]}
n
Regards!
AWK SOLUTION
This is not an elegant solution... my knowledge of awk is limited.
You should find this functional.
I've updated this to remove the obsolete uniq as suggested by #socowi.
I've also included the printf version as #socowi suggested.
ls *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
OR
printf %s\\n *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
Old understanding below
Find files with name matching pattern.
Now take the second field since your results will likely be similar to ./
find . -type f -iname "*.json" | cut -d. -f2
To get the unique headings....
find . -type f -iname "*.json" | cut -d. -f2 | sort | uniq

grep search with filename as parameter

I'm working on a shell script.
OUT=$1
here, the OUT variable is my filename.
I'm using grep search as follows:
l=`grep "$pattern " -A 15 $OUT | grep -w $i | awk '{print $8}'|tail -1 | tr '\n' ','`
The issue is that the filename parameter I must pass is test.log.However, I have the folder structure :
test.log
test.log.001
test.log.002
I would ideally like to pass the filename as test.log and would like it to search it in all log files.I know the usual way to do is by using test.log.* in command line, but I'm facing difficulty replicating the same in shell script.
My efforts:
var-$'.*'
l=`grep "$pattern " -A 15 $OUT$var | grep -w $i | awk '{print $8}'|tail -1 | tr '\n' ','`
However, I did not get the desired result.
Hopefully this will get you closer:
#!/bin/bash
for f in "${1}*"; do
grep "$pattern" -A15 "$f"
done | grep -w $i | awk 'END{print $8}'

Count the number of whitespaces in a file

File test
musically us
challenged a goat that day
spartacus was his name
ba ba ba blacksheep
grep -oic "[\s]*" test
grep -oic "[ ]*" test
grep -oic "[\t]*" test
grep -oic "[\n]*" test
All give me 4, when I expect 11
grep --version -> grep (BSD grep) 2.5.1-FreeBSD
Running this on OSX Sierra 10.12
Repeating spaces should not be counted as one space.
If you are open to tricks and alternatives you might like this one:
$ awk '{print --NF}' <(tr -d '\n' <file)
11
Above solution will count "whitespace" between words. As a result for a string of 'fifteen--> <--spaces' awk will measure 1, like grep.
If you need to count actual single spaces you can use this :
$ awk -F"[ ]" '{print --NF}' <<<"fifteen--> <--spaces"
15
$ awk -F"[ ]" '{print --NF}' <<<" 2 4 6 8 10"
10
$ awk -F"[ ]" '{print --NF}' <(tr -d '\n' <file)
11
One step forward, to count single spaces and tabs:
$ awk -F"[ ]|\t" '{print --NF}' <(echo -e " 2 4 6 8 10\t12 14")
13
tr is generally better for this (in most cases):
tr -d -C ' ' <file | wc -c
The grep solution relies on the fact that the output of grep -o is newline-separated — it will fail miserably for example in the following type of circumstance where there might be multiple spaces:
v='fifteen--> <--spaces'
echo "$v" | grep -o -E ' +' | wc -l
echo "$v" | tr -d -C ' ' | wc -c
grep only returns 1, when it should be 15.
EDIT: If you wanted to count multiple characters (eg. TAB and SPACE) you could use:
tr -dC $'[ \t]' <<< $'one \t' | wc -c
Just use awk:
$ awk -v RS=' ' 'END{print NR-1}' file
11
or if you want to handle empty files gracefully:
$ awk -v RS=' ' 'END{print NR - (NR?1:0)}' /dev/null
0
The -c option counts the number of lines that match, not individual matches. Use grep -o and then pipe to wc -l, which will count the number of lines.
grep -o ' ' test | wc -l

bash script, list all the files over a specific size

so I have some code like this:
result=`find . -type f -size -1000c -print0 | xargs -0 ls -Sh | head`
for i in $result; do
item=`wc -c $i`
echo $item1
done
this will print out all the files in the current fold that are at most 1000bytes, it has the format like:
size_of the file ./name_of_the_file
but i want to get rid of the "./" symbol, so i try to use "cut"
i want to do something like:
for i in $result; do
item=`wc -c $i`
item1=`cut -f 1 $item` // this gives me the size
item2=`cut -c 7- $item` // this gives me all the character after ./
echo item1, item2 // now make it print
done
but i'm getting error like:
cut: 639: No such file or directory
can anyone please give me a hint on this? I appreciate it.
Don't use cut when you can use bash variable expansion operators.
for i in $result; do
i=$(echo $i | cut -c3-) # remove ./ prefix
size=$(wc -c < $i)
echo $size, $i
done
To use cut with a variable, you have to echo the variable to a pipe, because cut processes a file or stdin (like most Unix filters).
By redirecting the wc input instead of giving the filename as an argument, it just prints the size, not the size and the filename, so there's no need to cut its output.
This is a bit more concise:
find ./ -type f -size -1000c -ls | sed -e 's/\.\///' | awk -e '{ print $7, $11 }'
Edited:
for i in $result; do
item=`wc -c $i`
item1=`echo $item | cut -d" " -f1` // this gives me the size
item2=`echo $item | cut -d" " -f2-` // this gives me all the charac...
echo $item1, $item2 // now make it print
done
As Barmar mentioned, you were passing the entire string to cut. Using the -d option can specify what character you split the string on, so assuming the size is separated from the remainder of the info by at least a space, the above should yield an $item1 and $item2 as you wanted.

Resources