shell script (with loop) to grep a list of strings one by one - bash

I have a big data text file (more than 100,000 rows) in this format:
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc
0.00118343;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00276134;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;
0.0607495;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00670611;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
Now, each row contains a gene name, such as in initial 4 rows there is CLCNKA gene. I am using grep command to count the frequency of each gene name in this data file, as:
grep -w "CLCNKA" my_data_file | wc -l
There are about 300 genes in a separate file which are to be searched in above data file. Can some expert please write a simple shell script with a loop to take gene name from a list one by one, and store its frequency in a separate file. So, the output file would be like this:
CLCNKA 4
XDH 2
GRK4 4

You've confused us. I and some others think all you want is a count of each gene in the file since that's what your input/output and some of your descriptive text states (count the frequency of each gene name in this data file) which would just be this:
$ awk -F'[=;]' '{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' file
GRK4 4
CLCNKA 4
XDH 2
while everyone else thinks you want a count of specific genes that exist in a different file since that's what your Subject line, proposed algorithm and the rest of your text states.
If everyone else is right then you'd need this tweak to read the "genes" file first and only count the genes in "file" that were listed in "genes":
awk -F'[=;]' 'NR==FNR{genes[$0]; next} $11 in genes{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' genes file
GRK4 4
CLCNKA 4
XDH 2
Your example doesn't help since it would produce the same output with either interpretation of your requirements so edit your question to clarify what it is you want. In particular if there are genes that you do NOT want counted then include lines containing those in the sample input.

awk is your friend
awk '{sub(/^.*Gene\.refGene=/,"");sub(/;.*$/,"");
genelist[$0]++}END{for(i in genelist){print i,genelist[i]}}' file
Output
GRK4 4
CLCNKA 4
XDH 2
Sidenote: This may not give you the gene name frequency in the order in which they appear in the file. I guess that is not a requirement afterall.

This can also be done in pure bash, by using the associative array feature to count the frequencies:
#!/bin/bash
# declare assoc array
declare -A freq
# split stdin input csv
for gene in $(cut -d ';' -f 6|cut -d = -f 2);do
let freq[$gene]++
done
# loop over array keys
for key in ${!freq[#]}; do
echo ${key} ${freq[$key]}
done

A simpler solution relying on the uniq command:
#!/bin/bash
cut -d ';' -f 6|cut -d = -f 2|sort|uniq -c|while read -a kv;do
echo ${kv[1]} ${kv[0]}
done

Here is one-liner:
sed "s/.*Gene.refGene=//;s/\;.*//" test | sort | uniq -c | awk '{print $2,$1}'
sed - will remove everything from line except gene name
sort will do sorting by name
uniq -c - will count number of gene repeats
awk with swap uniq output (by default it : count pattern)

To preserve order provided input file is sorted as given in sample:
$ perl -lne '
($g) = /Gene\.refGene=([^;]+)/;
if($g ne $p && $. > 1)
{
print "$p\t$c";
$c = 0;
}
$c++; $p = $g;
END { print "$p\t$c" }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
If not, use hash variable to increment gene name used as key and an array to store key order
$ perl -lne '
($k) = /Gene\.refGene=([^;]+)/;
push(#o, $k) if !$h{$k}++;
END { print "$_\t$h{$_}" foreach (#o) }' ip.txt
CLCNKA 4
XDH 2
GRK4 4

if you only search for a list of genes, an inefficient but straightforward way
read g; do echo -n $g " "; grep -c $g file; done < genes
assuming your genes are listed one at a time in the genes file.
If your file structure is fixed, a more efficient version will be
awk 'NR==FNR{genes[$1];next}
{sub(/Gene.refGene=/,"",$6)}
$6 in genes{count[$6]++}
END{for(g in count) print g,count[g]}' genes FS=';' file

Related

awk to do group by sum of column

I have this csv file and I am trying to write shell script to calculate sum of column after doing group by on it. Column number is 11th (STATUS)
My script is
awk -F, 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' $f > $parentdir/outputfile.csv;
File output expected is
COMMITTED 2
but actual output is just 2.
It prints only count and not group by sum. If I delete any other columns and run same query then it works fine but not with below sample data.
FILE NAME;SEQUENCE NR;TRANSACTION ID;RUN NUMBER;START EDITCREATION;END EDITCREATION;END COMMIT;EDIT DURATION;COMMIT DURATION;HAS DEPENDENCY;STATUS;DETAILS
Buldhana_Refinesource_FG_IW_ETS_000001.xml;1;4a032127-b20d-4fa8-9f4d-7f2999c0c08f;1;20180831130210345;20180831130429638;20180831130722406;140;173;false;COMMITTED;
Buldhana_Refinesource_FG_IW_ETS_000001.xml;2;e4043fc0-3b0a-46ec-b409-748f98ce98ad;1;20180831130722724;20180831130947144;20180831131216693;145;150;false;COMMITTED;
change the FS to ; in your script
awk -F';' 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' file
COMMITTED 2
You're using wrong field separator. Use
awk -F\;
; must be escaped to use it as a literal. Except this, your approach seems OK.
Besides awk, you may also use
tail -n +2 $f | cut -f11 -d\; | sort | uniq -c
or
datamash --header-in -t \; -g 11 count 11 < $f
to do the same thing.

Extract lines from file2 that exist in file1 using a loop

I am very new at shell scripting and I am having some trouble with the following task:
I want to extract lines from file2 that are found also in file1 and extract those lines to a new file3. I am only allowed to use loops for this (I know it works with the basic grep command, but I need to find a way with a loop)
File1
John 5 red books
Ashley 4 yellow music
Susan 8 green films
File2
John
Susan
Desired output for file3 would be:
John 5 red books
Susan 8 green films
The desired output has to be found using bash script and a loop. I have tried the following loop, but I am missing some lines in the results by using this:
while read line
do
grep "${line}" $file1
done < $file2 >> file3.txt
If anyone has any thoughts on how to improve my script or any new ideas (again using loops) it would be greatly appreciated. Thank you!
Looping here is a good educational exercise but it isn't ideal for this in the real world.
Technically, this AWK solution works and uses a loop, but I'm guessing it's not what your instructor is looking for:
awk 'NR == FNR { find[$1]=1; next } find[$1]' File2 File1 >File3
I've swapped the order of the files so the file with the data (File1) is loaded after the file listing what we want (File2).
This starts with a condition that ensures we're on the first file AWK reads (NR is the "number of records" (lines) seen so far across all inputs and FNR is the current file's number of records, so since this clause requires them to be the same value, it can only fire on the first input file). It sets a hash (a data structure with key/value pairs, a.k.a. an associative array or dictionary) whose key is the value of the first column ($1) on the line so we can extract it later, then next skips the later stanza for that input line.
When the code loops through the next file (File1), the first clause does not fire and instead the first column of input is looked up in the find hash. If it is present, its value is 1 and that evaluates to true, so we print the value. (A clause with no action implies { print })
See Toby Speight's answer for a native bash answer with only builtins. It uses loops and hashes. You'll likely find that solution is slower on larger data sets.
Since you're using Bash, you could create an associative array from File2, and use that to check membership. Something like (untested):
read -a names <File2
local -A n
for i in "${names[#]}"
do n["$i"]="$i"
done
while read -r name rest
do [ "${n[$name]}" ] && echo "$name $rest"
done <File1 >file3
Awk solution:
awk 'NR==FNR{ arr[$0]="";next } { for (i in arr) { if (i == $1 ) { print $0 } } }' file2 file1
First we create an array of with the data in file2. We then use this to check the first space delimited piece of data and print if there is a match,
With awk :
$ awk 'NR==FNR{ a[$1];next } $1 in a' file2 file1`
With grep:
$ grep -F -f file2 file1

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

bash - how do I use 2 numbers on a line to create a sequence

I have this file content:
2450TO3450
3800
4500TO4560
And I would like to obtain something of this sort:
2450
2454
2458
...
3450
3800
4500
4504
4508
..
4560
Basically I would need a one liner in sed/awk that would read the values on both sides of the TO separator and inject those in a seq command or do the loop on its own and dump it in the same file as a value per line with an arbitrary increment, let's say 4 in the example above.
I know I can use several one temp file, go the read command and sorts, but I would like to do it in a one liner starting with cat filename | etc. as it is already part of a bigger script.
Correctness of the input is guaranteed so always left side of TOis smaller than bigger side of it.
Thanks
Like this:
awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}' file
or, if you like starting with cat:
cat file | awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}'
Something like this might work:
awk -F TO '{system("seq " $1 " 4 " ($2 ? $2 : $1))}'
This would tell awk to system (execute) the command seq 10 4 10 for lines just containing 10 (which outputs 10), and something like seq 10 4 40 for lines like 10TO40. The output seems to match your example.
Given:
txt="2450TO3450
3800
4500TO4560"
You can do:
echo "$txt" | awk -F TO '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i++) print i}'
If you want an increment greater than 1:
echo "$txt" | awk -F TO -v p=4 '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i+=p) print i}'
Give a try to this:
sed 's/TO/ /' file.txt | while read first second; do if [ ! -z "$second" ] ; then seq $first 4 $second; else printf "%s\n" $first; fi; done
sed is used to replace TO with space char.
read is used to read the line, if there are 2 numbers, seq is used to generate the sequence. Otherwise, the uniq number is printed.
This might work for you (GNU sed):
sed -r 's/(.*)TO(.*)/seq \1 4 \2/e' file
This evaluates the RHS of the substitution command if the LHS contains TO.

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources