analyze two fields at a time with awk - for-loop

I have a file with approximately 1 000 000 fields (tab delimited), but I need to incrementally look at fields in pairs to see if they are identical or different.
Here is 1 line of the file (abbreviated to 6 fields):
C G G G T A
I essentially need to print 1 if the pairs are identical and 2 if the pairs are different, so the output should be:
2 1 2
Is this possible with an awk for loop? Using awk '{ if ($1==$2) print "1"; else print "2" }' is simply not viable for the number of fields I have.
Thank you!

you can try,
echo "C G G G T A" |
awk '{
for(i=1; i<=NF; i+=2){
printf (i<NF-1?"%s ":"%s\n"), ($i==$(i+1)?1:2)
}
}'
you get,
2 1 2

I would do it with sed instead, probably much quicker (no splitting):
sed -r 's/(^\S|\s\S)\s/\1/g; s/(\S)\1/1/g; s/\S\S/2/g'
The first s/ groups pairs by removing the space between them.
The second s/ finds the matches.
The third s/ converts the leftovers (mismatches).
Or the equivalent, if your sed does not have -r:
sed 's/^\(\S\)\s/\1/; s/\(\s\S\)\s/\1/g; s/\(\S\)\1/1/g; s/\S\S/2/g'

Related

Keeping the last two fields in an input line in linux

I have the following problem:
I need to process lines structured as follows:
<e_1> <e_2> ... <e_n-1> <e_n>
where each <e_i> (except for <e_n>) is separated from the next by a single space character. The actual number of <e_i> elements in each line is always at least two, but otherwise unpredictable: one line might consist of five such elements, while the next might have twelve.
For each such line I must remove all the elements, except for the last two - e.g. if the input line is
a b c d e
after processing I should end up with the line
d e
What tool accessible from a bash script would allow me to pull this off?
Just use awk to filter the last two columns:
awk '{print $(NF-1), $NF}'
eg:
$ printf 'a b c d e\nf g\na b c\n' | awk '{print $(NF-1), $NF}'
d e
f g
b c
Actually, immediately after posting this I noticed that a combination of rev and cut will do the trick.
A sed one-liner:
sed 's/.* \(.* .*\)$/\1/'

How to sort a file by line length and then alphabetically for the second key?

Say I have a file:
ab
aa
c
aaaa
I would like it to be sorted like this
c
aa
ab
aaaa
That is to sort by line length and then alphabetically. Is that possible in bash?
You can prepend the length of the line to each line, then do a numerical sorting, and finally cutting out the numbers
< your_file awk '{ print length($0), $0; }' | sort -n | cut -f2
You see that I've accomplished the sorting via sort -n, without doing any multi-key sorting. Honestly I was lucky that this worked:
I didn't think that lines could begin with numbers and so I expected sort -n to work because alphabetic and numeric sorting give the same result if all the strings are the same length, as is the case exaclty because we are sorting by the line length which I'm adding via awk.
It turns out everything works even if your input has lines starting with digits, the reason being that sort -n
sorts numerically on the leading numeric part of the lines;
in case of ties, it uses strcmp to compare the whole lines
Here's some demo:
$ echo -e '3 11\n3 2' | sort -n
3 11
3 2
# the `3 ` on both lines makes them equal for numerical sorting
# but `3 11` comes before `3 2` by `strcmp` before `1` comes before `2`
$ echo -e '3 11\n03 2' | sort -n
03 2
3 11
# the `03 ` vs `3 ` is a numerical tie,
# but `03 2` comes before `3 11` by `strcmp` because `0` comes before `3`
So the lucky part is that the , I included in the awk command inserts a space (actually an OFS), i.e. a non-digit, thus "breaking" the numeric sorting and letting the strcmp sorting kick in (on the whole lines which compare equal numerically, in this case).
Whether this behavior is POSIX or not, I don't know, but I'm using GNU coreutils 8.32's sort. Refer to this question of mine and this answer on Unix for details.
awk could do all itself, but I think using sort to sort is more idiomatic (as in, use sort to sort) and efficient, as explained in a comment (after all, why would you not expect that sort is the best performing tool in the shell to sort stuff?).
Insert a length for the line using gawk (zero-filled to four places so it will sort correctly), sort by two keys (first the length, then the first word on the line), then remove the length:
gawk '{printf "%04d %s\n", length($0), $0}' | sort -k1 -k2 | cut -d' ' -f2-
If it must be bash:
while read -r line; do printf "%04d %s\n" ${#line} "${line}"; done | sort -k1 -k2 | (while read -r len remainder; do echo "${remainder}"; done)
For GNU awk:
$ gawk '{
a[length()][$0]++ # hash to 2d array
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # first sort on length dim
for(i in a) {
PROCINFO["sorted_in"]="#ind_str_asc" # and then on data dim
for(j in a[i])
for(k=1;k<=a[i][j];k++) # in case there are duplicates
print j
# PROCINFO["sorted_in"]="#ind_num_asc" # I don t think this is needed?
}
}' file
Output:
c
aa
ab
aaaa
aaaaaaaaaa
aaaaaaaaaa

shell script (with loop) to grep a list of strings one by one

I have a big data text file (more than 100,000 rows) in this format:
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc
0.00118343;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00276134;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;
0.0607495;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00670611;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
Now, each row contains a gene name, such as in initial 4 rows there is CLCNKA gene. I am using grep command to count the frequency of each gene name in this data file, as:
grep -w "CLCNKA" my_data_file | wc -l
There are about 300 genes in a separate file which are to be searched in above data file. Can some expert please write a simple shell script with a loop to take gene name from a list one by one, and store its frequency in a separate file. So, the output file would be like this:
CLCNKA 4
XDH 2
GRK4 4
You've confused us. I and some others think all you want is a count of each gene in the file since that's what your input/output and some of your descriptive text states (count the frequency of each gene name in this data file) which would just be this:
$ awk -F'[=;]' '{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' file
GRK4 4
CLCNKA 4
XDH 2
while everyone else thinks you want a count of specific genes that exist in a different file since that's what your Subject line, proposed algorithm and the rest of your text states.
If everyone else is right then you'd need this tweak to read the "genes" file first and only count the genes in "file" that were listed in "genes":
awk -F'[=;]' 'NR==FNR{genes[$0]; next} $11 in genes{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' genes file
GRK4 4
CLCNKA 4
XDH 2
Your example doesn't help since it would produce the same output with either interpretation of your requirements so edit your question to clarify what it is you want. In particular if there are genes that you do NOT want counted then include lines containing those in the sample input.
awk is your friend
awk '{sub(/^.*Gene\.refGene=/,"");sub(/;.*$/,"");
genelist[$0]++}END{for(i in genelist){print i,genelist[i]}}' file
Output
GRK4 4
CLCNKA 4
XDH 2
Sidenote: This may not give you the gene name frequency in the order in which they appear in the file. I guess that is not a requirement afterall.
This can also be done in pure bash, by using the associative array feature to count the frequencies:
#!/bin/bash
# declare assoc array
declare -A freq
# split stdin input csv
for gene in $(cut -d ';' -f 6|cut -d = -f 2);do
let freq[$gene]++
done
# loop over array keys
for key in ${!freq[#]}; do
echo ${key} ${freq[$key]}
done
A simpler solution relying on the uniq command:
#!/bin/bash
cut -d ';' -f 6|cut -d = -f 2|sort|uniq -c|while read -a kv;do
echo ${kv[1]} ${kv[0]}
done
Here is one-liner:
sed "s/.*Gene.refGene=//;s/\;.*//" test | sort | uniq -c | awk '{print $2,$1}'
sed - will remove everything from line except gene name
sort will do sorting by name
uniq -c - will count number of gene repeats
awk with swap uniq output (by default it : count pattern)
To preserve order provided input file is sorted as given in sample:
$ perl -lne '
($g) = /Gene\.refGene=([^;]+)/;
if($g ne $p && $. > 1)
{
print "$p\t$c";
$c = 0;
}
$c++; $p = $g;
END { print "$p\t$c" }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
If not, use hash variable to increment gene name used as key and an array to store key order
$ perl -lne '
($k) = /Gene\.refGene=([^;]+)/;
push(#o, $k) if !$h{$k}++;
END { print "$_\t$h{$_}" foreach (#o) }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
if you only search for a list of genes, an inefficient but straightforward way
read g; do echo -n $g " "; grep -c $g file; done < genes
assuming your genes are listed one at a time in the genes file.
If your file structure is fixed, a more efficient version will be
awk 'NR==FNR{genes[$1];next}
{sub(/Gene.refGene=/,"",$6)}
$6 in genes{count[$6]++}
END{for(g in count) print g,count[g]}' genes FS=';' file

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources