Creating an associative array from grep results in bash - bash

I am using the following loc in a bash script to return unique grep result strings and their counts:
hitStrings="$(eval 'find "$DIR" -type f -print0 | xargs -0 grep -roh "\w*$searchWord\w*"' | sort | uniq -c)"
For example, if I have a $searchWord of "you", I could get the following results:
5 Kougyou 2 Layout 10 layouts 2330 you 859 your 17 yourself
My questions are:
How to I created an associative array containing the strings that are returned as the keys, and their counts as the values?
How do I omit the initial searchWord and its count from the associative array above (so no you-859 when I search for "you")?
Thanks

you have too many unnecessary layers, you can achieve the same with
$ grep -roh "\w*$key\w*" | sort | uniq -c > counts
and
$ declare -A counts; while read -r v k; do counts[$k]=$v; done < counts
$ echo ${counts["you"]}
Note that depends on the usage, you may get away by the file itself. Again searching for "you" count from the file
$ awk -v key="you" '$2==key{print $1}' counts
if the same name confuses you change one of them, or remove the temp file altogether by substitution.
$ declare -A counts
$ while read -r v k; do counts[$k]=$v; done < <(grep -roh "\w*$key\w*" | sort | uniq -c)
or with evil eval you can do
$ eval declare -A counts=( $(grep -roh "\w*$key\w*" | sort | uniq -c | awk '{print "["$2"]="$1}') )
but why? The while loop is a perfectly fine solution.

Related

append output of each iteration of a loop to the same in bash

I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf.
I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.
I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:
wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf
I would like to have just one output file containing three columns:
Chromosome VCFCount FilteredVCFCount
chr1 out1 out1.filtered
chr2 out2 out2.filtered
Any help will be appreciated, thank you very much in advance :)
printf "%s\n" *.filtered.vcf |
cut -d. -f1 |
sort |
xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' --
Output newline separated list of files in the directory
Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
xargs -n1 For each one file execute the script sh -c
printf "%s\t%s\t%s\n" - output with custom format string ...
"$1" - the filename and...
"(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
"$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf
Example:
> touch chr{1..3}{,.filtered}.vcf
> echo > chr1.filtered.vcf ; echo > chr2.vcf ;
> printf "%s\n" *.filtered.vcf |
> cut -d. -f1 |
> sort |
> xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' --
chr1 0 1
chr2 1 0
chr3 0 0
To have nice looking table with headers, use column:
> .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o ' '
Chromosome VCFCount FilteredVCFCount
chr1 0 1
chr2 1 0
chr3 0 0
Maybe try this.
for chr in chr*.vcf; do
base=${chr%.vcf}
awk -v base="$base" 'BEGIN { OFS="\t"
# Remove this to not have this pesky header line
print "Chromosome", "VCFCount", "FilteredVCFCount"
}
FNR==1 && n { p=n }
{ n=FNR }
END { print base, p, n }' "$chr" "$base.filtered.vcf"
done >counts.txt
The very simple Awk script just collects the highest line number for each file (so we basically reimplement wc -l) and prints the collected numbers in the desired format. FNR is the line number in the current input file; we simply save this, and copy the value to p to keep the saved value from the previous file in a separate variable when we switch to a new file (starting over at line number 1).
The shell parameter substitution ${variable%pattern} retrieves the value of variable with any suffix match on pattern removed. (There is also ${variable#pattern} to remove a prefix, and Bash has ## and %% to trim the longest pattern match instead of the shortest.)
If efficiency is important, you could probably refactor all of the script into a single Awk script, but this way, all the pieces are simple and hopefully understandable.

Count of matching word, pattern or value from unix korn shell scripting is returning just 1 as count

I'm trying to get the count of a matching pattern from a variable to check the count of it, but it's only returning 1 as the results, here is what I'm trying to do:
x="HELLO|THIS|IS|TEST"
echo $x | grep -c "|"
Expected result: 3
Actual Result: 1
Do you know why is returning 1 instead of 3?
Thanks.
grep -c counts lines not matches within a line.
You can use awk to get a count:
x="HELLO|THIS|IS|TEST"
echo "$x" | awk -F '|' '{print NF-1}'
3
Alternatively you can use tr and wc:
echo "$x" | tr -dc '|' | wc -c
3
$ echo "$x" | grep -o '|' | grep -c .
3
grep -c does not count the number of matches. It counts the number of lines that match. By using grep -o, we put the matches on separate lines.
This approach works just as well with multiple lines:
$ cat file
hello|this|is
a|test
$ grep -o '|' file | grep -c .
3
The grep manual says:
grep, egrep, fgrep - print lines matching a pattern
and for the -c flag:
instead print a count of matching lines for each input file
and there is just one line that match
You don't need grep for this.
pipe_only=${x//[^|]} # remove everything except | from the value of x
echo "${#pipe_only}" # output the length of pipe_only
Try this :
$ x="HELLO|THIS|IS|TEST"; echo -n "$x" | sed 's/[^|]//g' | wc -c
3
With only one pipe with perl:
echo "$x" |
perl -lne 'print scalar(() = /\|/g)'

Inifinite loop in bash

I have written the following command to loop over a set of strings in the second column of my file and then do sorting for each string on column 11, then take the second and eleventh column and count the number of unique occurrences. Very simple but it seems that it enters an infinite loop and I can't see why. I would appreciate your help very much.
for item in $(cat file.txt | cut -f2 -d " "| uniq)
do
sort -k11,11 file.txt | cut -f2,11 -d " " | uniq -c | sort -k2,2 > output
done
There's no infinite loop here, but it is a very silly loop (that takes a long time to run, while not accomplishing the script's stated purpose). Let's look at how one might accomplish that purpose more sanely:
Using a temporary file for counts.txt to avoid needing to rerun the sort, cut and uniq steps on each iteration:
sort -k11,11 file.txt | cut -f2,11 -d " " | uniq -c >counts.txt
while read -r item; do
fgrep -e " ${item}" counts.txt
done < <(cut -f2 -d' ' <file.txt | uniq)
Even better, using bash 4 associative arrays and no temporary file:
# reads counts into an array
declare -A counts=( )
while read -r count item; do
counts[$item]=count
done < <(sort -k11,11 file.txt | cut -f2,11 -d " " | sort | uniq -c)
# reads counts back out
while read -r item; do
echo "$item ${counts[$item]}"
done < <(cat file.txt | cut -f2 -d " "| sort | uniq)
...that said, that's only if you want to use sort for ordering on pulling data back out. If you don't need to do that, the latter part could be replaced as such:
# read counts back out
for item in "${!counts[#]}"; do
echo "$item ${counts[$item]}"
done

How to sort and get unique values from an array in bash?

Im new to bash scripting... Im trying to sort and store unique values from an array into another array.
eg:
list=('a','b','b','b','c','c');
I need,
unique_sorted_list=('b','c','a')
I tried a couple of things, didnt help me ..
sorted_ids=($(for v in "${ids[#]}"; do echo "$v";done| sort| uniq| xargs))
or
sorted_ids=$(echo "${ids[#]}" | tr ' ' '\n' | sort -u | tr '\n' ' ')
Can you guys please help me in this ....
Try:
$ list=(a b b b c c)
$ unique_sorted_list=($(printf "%s\n" "${list[#]}" | sort -u))
$ echo "${unique_sorted_list[#]}"
a b c
Update based on comments:
$ uniq=($(printf "%s\n" "${list[#]}" | sort | uniq -c | sort -rnk1 | awk '{ print $2 }'))
The accepted answer doesn't work if array elements contain spaces.
Try this instead:
readarray -t unique_sorted_list < <( printf "%s\n" "${list[#]}" | sort -u )
In Bash, readarray is an alias to the built-in mapfile command. See help mapfile for details.
The -t option is to remove the trailing newline (used in printf) from each line read.

Elegant way to check for equal values within an array or any given textfile

Hello i'm fairly new to scripting, and struggling with trying to test/check if 4 lines in a textfile are equal to eachother, and i cannot figure this one out since comparison examples are all with two variables. i've come up with this:
#!/bin/sh
#check if mxf videofiles are older than 10 minutes and parse them into tclist.txt
find . -amin +10 |sed "s/^..//" >tclist.txt
#grep timecode and cut : from the output of mxfprobe and place that into variable TC
for z in $(cat tclist.txt); do TC=$(mxfprobe -i "$z" 2>&1 |grep timecode|sed "s/[^0-9]*//"|sed "s/://"|sed "s/://"|sed "s/://")
echo $TC >>offsetcheck.txt
done;
The output of offsetcheck.txt then looks like this:
10194013
10194013
10194014
10194014
How can i test if those 4 values are equal to eachother? (in this example two files are drifted one frame)
I've tried to place those values into an array and check them for uniqueness...
exec 10<&0
exec < offsetcheck.txt
let count=0
while read LINE; do
ARRAY[$count]=$LINE
((count++))
done
echo ${ARRAY[#]}
exec 0<&10 10<&-
if ($ARRAY !== array_unique($ARRAY))
{
echo There were duplicate values
}
... struggling with trying to test/check if 4 lines in a textfile are
equal to eachother
You could use sort and wc to determine the number of unique values in the file. The following would tell whether the file contains unique values or not:
(( $(sort -u offsetcheck.txt | wc -l) == 1 )) && echo "File contains unique values" || echo "File does not contain unique values"
If you wanted to do the same for an array, you could say:
for i in "${ARRAY[#]}"; do echo "$i" ; done | sort -u | wc -l
to get the number of unique values in the array.
If the values in the array are guaranteed not to have any space, then saying:
echo "${ARRAY[#]}" | tr ' ' '\n' | sort -u | wc -l
would suffice. (But note the if above.)
Looks to me like the whole process can be reduced to
n=$(
find . -amin +10 |
sed "s/^..//" |
xargs -I FILE mxfprobe -i "FILE" 2>&1 |
grep -h timecode |
sed 's/[^0-9]//g' |
sort -u |
wc -l
)
Then check if n == 1

Resources