Getting a list of substring based unique filenames in an array - bash

I have a directory my_dir with files having names like:
a_v5.json
a_v5.mapping.json
a_v5.settings.json
f_v39.json
f_v39.mapping.json
f_v39.settings.json
f_v40.json
f_v40.mapping.json
f_v40.settings.json
c_v1.json
c_v1.mapping.json
c_v1.settings.json
I'm looking for a way to get an array [a_v5, f_v40, c_v1] in bash. Here, array of file names with the latest version number is what I need.
Tried this: ls *.json | find . -type f -exec basename "{}" \; | cut -d. -f1, but it returns the results with files which are not of the .json extension.

You can use the following command if your filenames don't contain whitespace and special symbols like * or ?:
array=($(
find . -type f -iname \*.json |
sed -E 's|(.*/)*(.*_v)([0-9]+)\..*|\2 \3|' |
sort -Vr | sort -uk1,1 | tr -d ' '
))
It's ugly and unsafe. The following solution is longer but can handle all file names, even those with linebreaks in them.
maxversions() {
find -type f -iname \*.json -print0 |
gawk 'BEGIN { RS = "\0"; ORS = "\0" }
match($0, /(.*\/)*(.*_v)([0-9]+)\..*/, group) {
prefix = group[2];
version = group[3];
if (version > maxversion[prefix])
maxversion[prefix] = version
}
END {
for (prefix in maxversion)
print prefix maxversion[prefix]
}'
}
mapfile -d '' array < <(maxversions)
In both cases you can check the contents of array with declare -p array.

Arrays and bash string parsing.
declare -A tmp=()
for f in $SOURCE_DIR/*.json
do f=${f##*/} # strip path
tmp[${f%%.*}]=1 # strip extraneous data after . in filename
done
declare -a c=( $( printf "%s\n" "${!tmp[#]}" | cut -c 1 | sort -u ) ) # get just the first chars
declare -a lst=( $( for f in "${c[#]}"
do printf "%s\n" "${!tmp[#]}" |
grep "^${f}_" |
sort -n |
tail -1; done ) )
echo "[ ${lst[#]} ]"
[ a_v5 c_v1 f_v40 ]
Or, if you'd rather,
declare -a arr=( $(
for f in $SOURCE_DIR/*.json
do d=${f%/*} # get dir path
f=${f##*/} # strip path
g=${f:0:2} # get leading str
( cd $d && printf "%s\n" ${g}*.json |
sort -n | sed -n '$ { s/[.].*//; p; }' )
done | sort -u ) )
echo "[ ${arr[#]} ]"
[ a_v5 c_v1 f_v40 ]

This is one possible way to accomplish this :
arr=( $( { for name in $( ls {f,n,m}*.txt ); do echo ${name:0:1} ; done; } | sort | uniq ) )
Output :
$ echo ${arr[0]}
f
$ echo ${arr[1]}
m
$ echo ${arr[2]}
n
Regards!

AWK SOLUTION
This is not an elegant solution... my knowledge of awk is limited.
You should find this functional.
I've updated this to remove the obsolete uniq as suggested by #socowi.
I've also included the printf version as #socowi suggested.
ls *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
OR
printf %s\\n *.json | cut -d. -f1 | sort -rn | awk -v last="xx" '$1 !~ last{ print $1; last=substr($1,1,3) }'
Old understanding below
Find files with name matching pattern.
Now take the second field since your results will likely be similar to ./
find . -type f -iname "*.json" | cut -d. -f2
To get the unique headings....
find . -type f -iname "*.json" | cut -d. -f2 | sort | uniq

Related

How to return an MD5 and SHA1 value for multiple files in a directory using BASH

I am creating a BASH script to take a directory as an argument and return to std out a list of all files in that directory with both the MD5 and SHA1 value of the files present in that directory. The only files I'm interested in are those between 100 and 500K. So far I gotten this far. (Section of Script)
cd $1 &&
find . -type f -size +100k -size -500k -printf '%f \t %s \t' -exec md5sum {} \; |
awk '{printf "NAME:" " " $1 "\t" "MD5:" " " $3 "\t" "BYTES:" "\t" $2 "\n"}'
I'm getting a little confused when adding the Sha1 and obviously leaving something out.
Can anybody suggest a way to achieve this.
Ideally I'd like the script to format in the following way
Name Md5 SHA1
(With the relevant fields underneath)
Your awk printf bit is overly complicated. Try this:
find . -type f -printf "%f\t%s\t" -exec md5sum {} \; | awk '{ printf "NAME: %s MD5: %s BYTES: %s\n", $1, $3, $2 }'
Just read line by line the list of files outputted by find:
find . -type f |
while IFS= read -r l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
It's better to use a zero separated stream:
find . -type f -print0 |
while IFS= read -r -d '' l; do
echo "$(basename "$l") $(md5sum <"$l" | cut -d" " -f1) $(sha1sum <"$l" | cut -d" " -f1)"
done
You could speed up something with xargs and multiple processes with -P option to xargs:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' --
Consider adding -maxdepth 1 to find if you are not interested in files in subdirectories recursively.
It's easy from xargs to go to -exec:
find . -type f -exec sh -c 'echo "$1 $(md5sum <"$1" | cut -d" " -f1) $(sha1sum <"$1" | cut -d" " -f1)"' -- {} \;
Tested on repl.
Add those -size +100k -size -500k args to find to limit the sizes.
The | cut -d" " -f1 is used to remove the - that is outputted by both md5sum and sha1sum. If there are no spaces in filenames, you could run a single cut process for the whole stream, so it should be slightly faster:
find . -type f -print0 |
xargs -0 -n1 sh -c 'echo "$(basename "$1") $(md5sum <"$1") $(sha1sum <"$1")"' -- |
cut -d" " -f1,2,5
I also think that running a single md5sum and sha1sum process maybe would be faster rather then spawning multiple separate processes for each file, but such method needs storing all the filenames somewhere. Below a bash array is used:
IFS=$'\n' files=($(find . -type f))
paste -d' ' <(
printf "%s\n" "${files[#]}") <(
md5sum "${files[#]}" | cut -d' ' -f1) <(
sha1sum "${files[#]}" | cut -d' ' -f1)
Your find is fine, you want to join the results of two of those, one for each hash. The command for that is join, which expects sorted inputs.
doit() { find -type f -size +100k -size -500k -exec $1 {} + |sort -k2; }
join -j2 <(doit md5sum) <(doit sha1sum)
and that gets you the raw data in sane environments. If you want pretty data, you can use the column utility:
join -j2 <(doit md5sum) <(doit sha1sum) | column -t
and add nice headers:
(echo Name Md5 SHA1; join -j2 <(doit md5sum) <(doit sha1sum)) | column -t
and if you're in an unclean environment where people put spaces in file names, protect against that by subbing in tabs for the field markers:
doit() { find -type f -size +100k -size -500k -exec $1 {} + \
| sed 's, ,\t,'| sort -k2 -t$'\t' ; }
join -j2 -t$'\t' <(doit md5sum) <(doit sha1sum) | column -ts$'\t'

integer expression expected while running the bash script

While running my below script from the jenkins's execute shell option, I'm getting -- [: 1 2 3 4 5 : integer expression expected, I tried using > symbol too without any lucks, I'm not sure exactly where I went wrong.
Any help will be really helpful.
#!/bin/bash
declare -a folders
declare -a folders_req
db_ver=<the value which I got from my DB with trimmed leading & trailing spaces, like below>
#db_ver=`echo $( get_value ) |sed -e 's/\-//g' | grep -oP '(?<=DESCRIPTION)(\s+)?([^ ]*)' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '| cut -d '/' -f2`
scripts_db_dir=`ls -td -- */ | head -1 | cut -d '/' -f1| sed -e 's/^[[:space:]]//g'`
cd ${scripts_db_dir}
folders=`ls -d */ | sed 's/\///g' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '`
for i in "${folders[#]}"; do
if [ "${i}" -gt "${db_ver}" ]; then
echo "inside loop: $i"
folders_req+=("$i")
fi
done
#echo "$i"
#echo ${folders_req[#]}
scripts_db_dir contains directory named like - 1 2 3 4 5
your folders variable should be initialized as an array and not as a string, eg :
folders=($(ls -d */ | sed 's/\///g' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '))
Given the various comments regarding "parsing ls is bad", consider using find instead:
find * -maxdepth 1 -type d -name '[0-9]*' -print
where:
-maxdepth 1 - searches only the current directory, no sub directories
-type d - looks only for directories
-name '[0-9]*' (or '[[:digit:]]*') - matches only items consisting of all digits
-print - just print the results
Thus:
folders=($(find * -maxdepth 1 -type d -name '[0-9]*' -print))
or just:
for i in $(find * -maxdepth 1 -type d -name '[0-9]*' -print); do

bash: programmatically assemble list

I'm trying to write a shell script which is assembling a list that will later be passed to sort -n. If I do:
find . -type f -printf "%s\n" | sort -n
the output is sorted just as I expect. What I can't figure out is how to assemble the list from inside the script itself. Here is the current script which tries to sum up how much space is used in a directory, sorted by file extension:
#!/bin/sh
echo -n "Enter directory/path to analyze: "
read path
extList=` find $path -type f -print | awk ' BEGIN {FS="."}{ print $NF }' | grep -v '/' | sort | uniq `
for ext in $extList; do
byteList=`find $path -type f -name \*.$ext -printf '%s\n' `
sum=0
for b in $byteList; do
sum=$(( $sum + $b ))
done
sum=$(( $sum/1024 ))
list+=`printf " $sum KB $ext\n"`
done
echo $list | sort -n
I've tried a lot of things for the list+= line, but I don't get a true list. I wind up with everything appearing as a single line, unsorted.
Here's a Minimal, Complete, and Verifiable example of what you're seeing:
echo "$(printf 'foo\n')$(printf 'bar\n')"
Expected:
foo
bar
Actual:
foobar
This is because trailing linefeeds are stripped in the contents of $(..) and `..` command substitution.
Instead, you can use $'\n' or a literal linefeed. Both of these will correctly append a linefeed:
list+="foo"$'\n'
list+="bar
"
Once you fix that, here's your next MCVE:
list="foo
bar"
echo $list
Expected:
foo
bar
Actual:
foo bar
This is due to the lack of quoting in echo $list. It should be echo "$list".
However, none of this is the bash way of doing things. Instead of accumulating into a variable and then using the variable, just pipe the data. This is what you're doing:
list=""
for word in foo bar baz
do
list+="$word"$'\n'
done
echo "$list" | sort -n
This is more canonical:
for word in foo bar baz
do
echo "$word"
done | sort -n
One problem is that `cmd` strips trailing newlines. Another is that echo $list doesn't quote "$list", so newlines are printed as spaces.
There's no need to build a list variable to then sort it later, though. Instead, try sorting all of the loop's output.
for ext in $extList; do
...
printf " %s KB %s\n" "$sum" "$ext"
done | sort -n
I'd suggest not storing the extension list in a string either. You could use a function:
extList() {
find "$path" -maxdepth 1 -type f -printf '%P\n' | awk -F. 'NF>1 {print $NF}' | sort -u
}
extList | while IFS= read -r ext; do
...
done | sort -n
Or store them in an array:
readarray -t extList < <(find "$path" -maxdepth 1 -type f -printf '%P\n' | awk -F. 'NF>1 {print $NF}' | sort -u)
for ext in "${extList[#]}"; do
...
done | sort -n

To get \n instead of n in echo -e command in shell script

I am trying to get the output for the echo -e command as shown below
Command used:
echo -e "cd \${2}\nfilesModifiedBetweenDates=\$(find . -type f -exec ls -l --time-style=full-iso {} \; | awk '{print \$6,\$NF}' | awk '{gsub(/-/,\"\",\$1);print}' | awk '\$1>= '$fromDate' && \$1<= '$toDate' {print \$2}' | tr \""\n"\" \""\;"\")\nIFS="\;" read -ra fileModifiedArray <<< "\$filesModifiedBetweenDates"\nfor fileModified in \${fileModifiedArray[#]}\ndo\n egrep -w "\$1" "\$fileModified" \ndone"
cd ${2}
Expected output:
cd ${2}
filesModifiedBetweenDates=$(find . -type f -exec ls -l --time-style=full-iso {} \; | awk '{print $6,$NF}' | awk '{gsub(/-/,"",$1);print}' | awk '$1>= '20140806' && $1<= '20140915' {print $2}' | tr "\n" ";")
IFS=; read -ra fileModifiedArray <<< $filesModifiedBetweenDates
for fileModified in ${fileModifiedArray[#]}
do
egrep -w $1 $fileModified
done
Original Ouput:
cd ${2}
filesModifiedBetweenDates=$(find . -type f -exec ls -l --time-style=full-iso {} \; | awk '{print $6,$NF}' | awk '{gsub(/-/,"",$1);print}' | awk '$1>= '20140806' && $1<= '20140915' {print $2}' | tr "n" ";")
IFS=; read -ra fileModifiedArray <<< $filesModifiedBetweenDates
for fileModified in ${fileModifiedArray[#]}
do
egrep -w $1 $fileModified
done
How can i handle "\" in this ?
For long blocks of text, it's much simpler to use a quoted here document than trying to embedded a multi-line string into a single argument to echo or printf.
cat <<"EOF"
cd ${2}
filesModifiedBetweenDates=$(find . -type f -exec ls -l --time-style=full-iso {} \; | awk '{print $6,$NF}' | awk '{gsub(/-/,"",$1);print}' | awk '$1>= '20140806' && $1<= '20140915' {print $2}' | tr "\n" ";")
IFS=; read -ra fileModifiedArray <<< $filesModifiedBetweenDates
for fileModified in ${fileModifiedArray[#]}
do
egrep -w $1 $fileModified
done
EOF
You'd better use printf to have a better control:
$ printf "tr %s %s\n" '"\n"' '";"'
tr "\n" ";"
As you see, we indicate the parameters within double quotes: printf "text %s %s" and then we define what content should be stored in this parameters.
In case you really have to use echo, then escape the \:
$ echo -e 'tr "\\n" ";"'
tr "\n" ";"
Interesting read: Why is printf better than echo?

Bash: Find file with max lines count

This is my try to do it
Find all *.java files
find . -name '*.java'
Count lines
wc -l
Delete last line
sed '$d'
Use AWK to find max lines-count in wc output
awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
then merge it to single line
find . -name '*.java' | xargs wc -l | sed '$d' | awk 'max=="" || data=="" || $1 > max {max=$1 ; data=$2} END{ print max " " data}'
Can I somehow implement counting just non-blank lines?
find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \; | \
sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
Replace the awk command with head -n1 if you also want to see the number of non-blank lines.
Breakdown of the command:
find . -type f -name "*.java" -exec grep -H -c '[^[:space:]]' {} \;
'---------------------------' '-----------------------'
| |
for each *.java file Use grep to count non-empty lines
-H includes filenames in the output
(output = ./full/path/to/file.java:count)
| sort -nr -t":" -k2 | awk -F: '{print $1; exit;}'
'----------------' '-------------------------'
| |
Sort the output in Print filename of the first entry (largest count)
reverse order using the then exit immediately
second column (count)
find . -name "*.java" -type f | xargs wc -l | sort -rn | grep -v ' total$' | head -1
To get the size of all of your files using awk is just:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
{ size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
To get the count of the non-empty lines, simply make the line where you increment the size[] conditional:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END { for (file in size) print size[file], file }
'
(If you want to consider lines that contain only blanks as "empty" then replace NF with /^./.)
To get only the file with the most non-empty lines just tweak again:
$ find . -name '*.java' -print0 | xargs -0 awk '
BEGIN { for (i=1;i<ARGC;i++) size[ARGV[i]]=0 }
NF { size[FILENAME]++ }
END {
for (file in size) {
if (size[file] >= maxSize) {
maxSize = size[file]
maxFile = file
}
}
print maxSize, maxFile
}
'
Something like this might work:
find . -name '*.java'|while read filename; do
nlines=`grep -v -E '^[[:space:]]*$' "$filename"|wc -l`
echo $nlines $filename
done|sort -nr|head -1
(edited as per Ed Morton's comment. I must have had too much coffee :-) )

Resources