awk with nested if else - bash

I have a tab delimited two column data. I want to get the third based on the condition applied on second column.
if second column is not equal to zero it should print col 1 and 3 and ratio of col1/col2
if col two is zero and col one is more than 15 than it should print col 1 and col2 and the value in col1 (in col 3) else (when col1<=15 & col2 is 0) it should print col1 col2 and 0.
for example, for a file like this
1 2
4 5
6 7
14 0
18 0
the output should be
1 2 0.5
4 5 0.8
6 7 0.85
14 0 0
18 0 18
What I have tried:
awk '{if ($2!=0) print $1 "\t" $2 "\t" $1/$2; elseif($2>15) print $1 "\t" $2 "\t" $1 ; else print $1 "\t" $2 "\t" $2}'<tags| head
Obviously I am doing something wrong, please help me in getting the above code right.
Thank you

Slightly different way:
awk '{if($2!=0) $3=$1/$2; else if($1>15) $3=$1; else $3=0}1' OFS='\t' file
Determined by the order of the if clause:
awk '{$3=0} $1>15{$3=$1} $2{$3=$1/$2}1' OFS='\t' file
or the cryptic version:
awk '{$3=$2?$1/$2:$1>15?$1:0}1' OFS='\t' file

a funny but unreadable(maybe) :) one-liner:
awk '{$0=$2?$1FS$2FS$1/$2:$1>15?$1FS$2FS$1:$1FS$2FS"0"}1' file
short explaination:
a=boolean? first : second
this means assign var a, if boolean true, using value first, otherwise use value second.
I set `$0 = $2? FOO : BAR`
FOO part: $1 FS $2 FS $1/$2
BAR part: $1>15? FOO2 : BAR2
FOO2 part: $1 FS $2 FS $1
BAR2 part: $1 FS $2 FS "0"
finally, print $0
Problem in your code
chang elseif -> else if also check $1 with 15, not $2 then your oneliner works too.

Here's another alternative:
awk '!$2 { $3 = $1>15 ? $1 : 0 } $2 { $3 = $1/$2 } 1' OFS='\t' CONVFMT='%.2g'
Output:
1 2 0.5
4 5 0.8
6 7 0.86
14 0 0
18 0 18

awk '{$3=$1>=15 && $2==0?$1:$1<15 && $2==0?0:$1/$2}1' your_file

Related

Awk if else with conditions

I am trying to make a script (and a loop) to extract matching lines to print them into a new file. There are 2 conditions: 1st is that I need to print the value of the 2nd and 4th columns of the map file if the 2nd column of the map file matches with the 4th column of the test file. The 2nd condition is that when there is no match, I want to print the value in the 2nd column of the test file and a zero in the second column.
My test file is made this way:
8 8:190568 0 190568
8 8:194947 0 194947
8 8:197042 0 197042
8 8:212894 0 212894
My map file is made this way:
8 190568 0.431475 0.009489
8 194947 0.434984 0.009707
8 19056880 0.395066 112.871160
8 101908687 0.643861 112.872348
1st attempt:
for chr in {21..22};
do
awk 'NR==FNR{a[$2]; next} {if ($4 in a) print $2, $4 in a; else print $2, $4 == "0"}' map_chr$chr.txt test_chr$chr.bim > position.$chr;
done
Result:
8:190568 1
8:194947 1
8:197042 0
8:212894 0
My second script is:
for chr in {21..22}; do
awk 'NR == FNR { ++a[$4]; next }
$4 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test_chr$chr.bim" "map_chr$chr.txt" >> "position.$chr"
done
And the result is:
1 0
1 0
1 0
1 0
The result I need is:
8:190568 0.009489
8:194947 0.009707
8:197042 0
8:212894 0
This awk should work for you:
awk 'FNR==NR {map[$2]=$4; next} {print $4, map[$4]+0}' mapfile testfile
190568 0.009489
194947 0.009707
197042 0
212894 0
This awk command processes mapfile first and stores $2 as key with $4 as a value in an associative array named as map.
Later when it processes testfile in 2nd block we print $4 from 2nd file with the stored value in map using key as $4. We add 0 in stored value to make sure that we get 0 when $4 is not present in map.

Skip lines starting with a character and delete lines matching second column lesser than a value

I have file with following format :
Qil
Lop
A D E
a 1 10
b 2 21
c 3 22
d 4 5
3 5 9
I need to skip lines that start with pattern 'Qil' or 'Lop' or 'A D E' and ones where the third column has a value greater than 10 and save the entire thing in 2 different files with formats as shown below.
Example output files :
Output file 1
Qil
Lop
A D E
a 1 10
d 4 5
3 5 9
Output file 2
a
d
3
My code :
while read -r line; if [[ $line == "A" ]] ||[[ $line == "Q" ]]||[[ $line == "L" ]] ; then
awk '$2 < "11" { print $0 }' test.txt
awk '$2 < "11" { print $1 }' test1.txt
done < input.file
Could you please try following.
awk '
/^Qil$|^Lop$|^A D E$/{
val=(val?val ORS:"")$0
next
}
$3<=10{
if(!flag){
print val > "file1"
flag=1
}
print > "file1"
if(!a[$1]++){
print $1> "file2"
}
}' Input_file
This will create 2 output files named file1 and file2 as per OP's requirements.
This can be done in a single awk:
awk '$1 !~ /^[QLA]/ && $2 <= 10' file
1 10
4 5
5 9
If you want to print only first column then use:
awk '$1 !~ /^[QLA]/ && $2 <= 10 { print $1 }' file
1
4
5

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

Resources