Bash shell: Count occurrences of pattern (in one file) listed in arrays (array elements loaded from different file) - bash

Hi I have loaded patterns of pattern.txt file into array and now I would like to grep count of each array element from second file (named as count.csv)
pattern.txt
abc
def
ghi
count.csv
1234,abc,joseph
5678,ramson,abc
2231,sam,def
1123,abc,richard
2521,ghi,albert
7371,jackson,def
bash shell script is given below:
declare -a myArray
myArray=( $(awk '{print $1}' ./pattern.txt))
for ((i=0; i < ${#myArray[*]}; i++))
do
var1=$(grep -c "${myArray[i]}" count.csv)
echo $var1
done
But, when I run the script, instead of giving below output
3
2
1
It gives output as
0
0
1
i.e. it only gives correct count of last array element.

grep + sort + uniq pipeline solution:
grep -o -w -f pattern.txt count.csv | sort | uniq -c
The output:
3 abc
2 def
1 ghi
grep options:
-f - obtain pattern(s) from file
-o - print only the matched parts of matching lines
-w - select only those lines containing matches that form whole words
The alternative awk approach:
awk 'NR==FNR{p[$0]; next}{ for(i=1;i<=NF;i++){ if($i in p) {p[$i]++; break} }}
END {for(i in p) print p[i],i}' pattern.txt FS="," count.csv
The output:
2 def
3 abc
1 ghi
p[$0] - accumulating patterns from the 1st input file (pattern.txt)
for(i=1;i<=NF;i++) - iterating though the fields of the line of the 2nd file (count.csv)
if($i in p) {p[$i]++; break} - incrementing counter for each matched pattern

It is better to use awk for processing text files line by line:
awk -F, 'NR==FNR {wrd[$1]; next} $2 in wrd{wrd[$2]++} $3 in wrd{wrd[$3]++}
END{for (w in wrd) print w, wrd[w]}' pattern.txt count.csv
def 2
abc 3
ghi 1
Reference: Effective AWK Programming

You could also skip the array and just loop over the patterns:
while read -r pattern; do
[[ -n $pattern ]] && grep -c "$pattern" count.csv
done < pattern.txt
grep -c outputs just the counts of the matches

Try using this command instead:
mapfile -t myArray < pattern.txt
for pattern in ${myArray[*]}; do
echo $(grep -o $pattern count.csv| wc -l)
done
Output:
3
2
1
mapfile will store every pattern in pattern.txt into myArray
The for loop will iterate through each pattern in myArray and print the number of occurrence of pattern in count.csv

Related

how to change words with the same words but with number at the back bash

I have a file for example with the name file.csv and content
adult,REZ
man,BRB
women,SYO
animal,HIJ
and a line that is nor a directory nor a file
file.csv BRB1 REZ3 SYO2
And what I want to do is change the content of the file with the words that are on the line and then get the nth letter of that word with the number at the end of the those words in capital
and the output should then be
umo
I know that I can get over the line with
for i in "${#:2}"
do
words+=$(echo "$i ")
done
and then the output is
REZ3 BRB1 SYO2
Using awk:
Pass the string of values as an awk variable and then split them into an array a. For each record in file.csv, iterate this array and if the second field of current record matches the first three characters of the current array value, then strip the target character from the first field of the current record and append it to a variable. Print the value of the aggregated variable.
awk -v arr="BRB1 REZ3 SYO2" -F, 'BEGIN{split(arr,a," ")} {for (v in a) { if ($2 == substr(a[v],0,3)) {n=substr(a[v],length(a[v]),1); w=w""substr($1,n,1) }}} END{print w}' file.csv
umo
You can also put this into a script:
#!/bin/bash
words="${2}"
src_file="${1}"
awk -v arr="$words" -F, 'BEGIN{split(arr,a," ")} \
{for (v in a) { \
if ($2 == substr(a[v],0,3)) { \
n=substr(a[v],length(a[v]),1); \
w=w""substr($1,n,1);
}
}
} END{print w}' "$src_file"
Script execution:
./script file.csv "BRB1 REZ3 SYO2"
umo
This is a way using sed.
Create a pattern string from command arguments and convert lines with sed.
#!/bin/bash
file="$1"
pat='s/^/ /;Te;'
for i in ${#:2}; do
pat+=$(echo $i | sed 's#^\([^0-9]*\)\([0-9]*\)$#s/.\\{\2\\}\\(.\\).*,\1$/\\1/;#')
done
pat+='Te;H;:e;${x;s/\n//g;p}'
eval "sed -n '$pat' $file"
Try this code:
#!/bin/bash
declare -A idx_dic
filename="$1"
pattern_string=""
for i in "${#:2}";
do
pattern_words=$(echo "$i" | grep -oE '[A-Z]+')
index=$(echo "$i" | grep -oE '[0-9]+')
pattern_string+=$(echo "$pattern_words|")
idx_dic["$pattern_words"]="$index"
done
pattern_string=${pattern_string%|*}
while IFS= read -r line
do
line_pattern=$(echo $line | grep -oE $pattern_string)
[[ -n $line_pattern ]] && line_index="${idx_dic[$line_pattern]}" && echo $line | awk -v i="$line_index" '{split($0, chars, ""); printf("%s", chars[i]);}'
done < $filename
first find the capital words pattern and catch the index corresponding
then construct the hole pattern words string which connect with |
at last, iterate the every line according to the pattern string, and find the letter by the index
Execute this script.sh like:
bash script.sh file.csv BRB1 REZ3 SYO2

append output of each iteration of a loop to the same in bash

I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf.
I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.
I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:
wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf
I would like to have just one output file containing three columns:
Chromosome VCFCount FilteredVCFCount
chr1 out1 out1.filtered
chr2 out2 out2.filtered
Any help will be appreciated, thank you very much in advance :)
printf "%s\n" *.filtered.vcf |
cut -d. -f1 |
sort |
xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' --
Output newline separated list of files in the directory
Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
xargs -n1 For each one file execute the script sh -c
printf "%s\t%s\t%s\n" - output with custom format string ...
"$1" - the filename and...
"(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
"$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf
Example:
> touch chr{1..3}{,.filtered}.vcf
> echo > chr1.filtered.vcf ; echo > chr2.vcf ;
> printf "%s\n" *.filtered.vcf |
> cut -d. -f1 |
> sort |
> xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' --
chr1 0 1
chr2 1 0
chr3 0 0
To have nice looking table with headers, use column:
> .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o ' '
Chromosome VCFCount FilteredVCFCount
chr1 0 1
chr2 1 0
chr3 0 0
Maybe try this.
for chr in chr*.vcf; do
base=${chr%.vcf}
awk -v base="$base" 'BEGIN { OFS="\t"
# Remove this to not have this pesky header line
print "Chromosome", "VCFCount", "FilteredVCFCount"
}
FNR==1 && n { p=n }
{ n=FNR }
END { print base, p, n }' "$chr" "$base.filtered.vcf"
done >counts.txt
The very simple Awk script just collects the highest line number for each file (so we basically reimplement wc -l) and prints the collected numbers in the desired format. FNR is the line number in the current input file; we simply save this, and copy the value to p to keep the saved value from the previous file in a separate variable when we switch to a new file (starting over at line number 1).
The shell parameter substitution ${variable%pattern} retrieves the value of variable with any suffix match on pattern removed. (There is also ${variable#pattern} to remove a prefix, and Bash has ## and %% to trim the longest pattern match instead of the shortest.)
If efficiency is important, you could probably refactor all of the script into a single Awk script, but this way, all the pieces are simple and hopefully understandable.

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
ABC
DEF
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
then
echo ABC >output.txt
else
echo -n
fi
if grep -i -q 'contains def' file.txt
then
echo DEF >>output.txt
else
echo -n
fi
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
ABC
DEF
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
do
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'
done

Bash script to add numbers from all files (each containing an integer) in a directory

I have many .txt files in a directory. Each file has only an integer.
How to write a bash script to add these integers and save the output to a file?
Just loop through the files extracting its integers and then sum them:
grep -ho '[0-9]*' files* | awk '{sum+=$1} END {print sum}'
Explanation
grep -ho '[0-9]*' files* extract numbers from the files whose name matches files*. We use -h to prevent getting the file name of the match and -o to just get the match, not the whole line.
awk '{sum+=$1} END {print sum}' loop through the values coming from grep and sum them. Finally, print the result.
Test
$ tail a*
==> a1 <==
hello 23 asd
asdfasfd
==> a2 <==
asdfasfd
is 15
==> a3 <==
$ grep -ho '[0-9]*' a* | awk '{sum+=$1} END {print sum}'
38
You can cat your files and then sum up using awk:
cat *.txt | awk '{x+=$0}END{print x}' > test.txt
test.txt should contain the sum.
Create some test files:
$ for f in {a,b,c,d}.txt; do
> echo $RANDOM > "$f"
> done
$ cat *.txt
18419
25511
31919
28810
Sum it using Bash:
$ i=0;
$ for f in *.txt; do
> ((i+=$(<"$f")));
> done
$ echo $i
104659

unix command to get lines from in between first and last occurence of a word and write to a file

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.
You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11
If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file
I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.
Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Resources