BASH - Summarising information from several fields in unique field using Loop and If statements - bash

I have the following tab-separated file:
A1 A1 0 0 2 1 1 1 1 1 1 1 2 1 1 1
A2 A2 0 0 2 1 1 1 1 1 1 1 1 1 1 1
A3 A3 0 0 2 2 1 1 2 2 1 1 1 1 1 1
A5 A5 0 0 2 2 1 1 1 1 1 1 1 2 1 1
The idea is to summarise the information between column 7 (included) and the end in a new column that is added at the end of the file.
To do so, these are the rules:
If the total number of “2”s in the row (between column 7 and the end) is 0: add “1 1” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 1: add “1 2” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 2 or more: add “2 2” to the new last column
I started to extract the columns I want to work on using the command:
awk '{for (i = 7; i <= NF; i++) printf $i " "; print ""}' myfile.ped > tmp_myfile.txt
Then I count the number of occurrence in each row using:
sed 's/[^2]//g' tmp_myfile.txtt | awk '{print NR, length }' >
tmp_occurences.txt
Which outputs:
1 1
2 0
3 2
4 1
Then my idea was to write a for loop that loops through the lines to add the new summary column.
I was thinking in this kind of structure, based on what I found here: http://www.thegeekstuff.com/2010/06/bash-if-statement-examples:
while read line ;
do
set $line
If ["$2"==0]
then
$3=="1 1"
elif ["$2"==1 ]
then
$3=="1 2”
elif ["$2">=2 ]
then
$3==“2 2”
else
print ["error"]
fi
done < tmp_occurences.txt
But I am stuck here. Do I have to create the new column before starting the loop? Am I going in the right direction?
Ideally, the final output (after merging the first 6 columns from the initial file and the summary column) would be:
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
Thank you for your help!

Using gnu-awk you can do:
awk -v OFS='\t' '{
c=0;
for (i=7; i<=NF; i++)
if ($i==2)
c++
if (c==0)
s="1 1"
else if (c==1)
s="1 2"
else
s="2 2"
NF=6
print $0, s
}' file
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
PS: If not using gnu-awk you can use:
awk -v OFS='\t' '{c=0; for (i=7; i<=NF; i++) {if ($i==2) c++; $i=""} if (c==0) s="1 1"; else if (c==1) s="1 2"; else s="2 2"; NF=6; print $0, s}' file

With GNU awk for the 3rd arg to match():
$ awk '{match($0,/((\S+\s+){6})(.*)/,a); c=gsub(2,2,a[3]); print a[1] (c>1?2:1), (c>0?2:1)}' file
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
With other awks you'd replace \S/\s with [^[:space:]]/[[:space:]] and use substr() instead of a[].

We can keep the format by using gensub() and capturing groups: we capture the 6 first fields and replace with them + the calculated values:
awk '{for (i=7; i<=NF; i++) {
if ($i==2)
twos+=1 # count number of 2's from 7th to last field
}
f7=1; f8=0 # set 7th and 8th fields's default value
if (twos)
f8=2 # set 8th = 2 if sum is > 0
if (twos>1)
f7=2 # set 7th = 2 if sum is > 1
$0=gensub(/^((\S+\s*){6}).*/,"\\1 " f7 FS f8, 1) # perform the replacement
twos=0 # reset counter
}1' file
As a one-liner:
$ awk '{for (i=7; i<=NF; i++) {if ($i==2) twos+=1} f7=1; f8=0; if (twos) f8=2; if (twos>1) f7=2; $0=gensub(/^((\S+\s*){6}).*/,"\\1 " f7 FS f8,1); twos=0}1' a
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 0
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2

$ cat > test.awk
{
for(i=1;i<=NF;i++) { # for every field
if(i<7)
printf "%s%s", $i,OFS # only output the first 6
else a[$i]++ # count the values of the of the fields
}
print (a[2]>1?"2 2":(a[2]==1?"1 2":"1 1")) # output logic
delete a # reset a for next record
}
$ awk -f test.awk test
A1 A1 0 0 2 1 1 2
A2 A2 0 0 2 1 1 1
A3 A3 0 0 2 2 2 2
A5 A5 0 0 2 2 1 2
Borrowing some ideas from #anubhava's solution above:
$ cat > another.awk
{
for(i=7;i<=NF;i++)
a[$i]++ # count 2s
NF=6 # truncate $0
print $0 OFS (a[2]<2?"1 "(a[2]?"2":"1"):"2 2") # print $0 AND 1 AND 1 OR 2 OR 2 AND 2
delete a # reset a for next record
}

Related

Count and percentage

Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24

Calculating the sum of every third column from many files

I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2

How to loop over a file having multiple columns to execute an AWK script?

I have a file with multiple columns (greater than 1000). Each column has numbers 0, 1 or some other. The tab delimited file looks like :
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
I want to calculate the occurrence of each unique digit for each column in the file. How do I do that using AWK or shell ?
P.S To calculate the occurrence of each unique digit in first column, i used AWK code :
awk '{h[$1]++}; END {for (k in h) print k, h[k]}' file > output-file
It gives the results as :
0 2
1 3
which means 0 occurs twice in column 1 and 1 occurs thrice in column 1.
I want to do the same for a file having over 1000 columns.
You just need to make the keys for associative array h contain both column number, i, and column value, $i:
$ awk '{for (i=1;i<=NF;i++) h[i" "$i]++}; END {for (k in h) print k, h[k]}' file | sort -n
1 0 2
1 1 3
2 0 3
2 1 1
2 2 1
3 0 5
The last line above indicates that column 3 has the value 0 occurring 5 times.
In more detail:
for (i=1;i<=NF;i++) h[i" "$i]++
This loops over all columns from the first, i-=1, to the last, i=NF. For each column, it updates the counter h for that column and its value.
END {for (k in h) print k, h[k]}
This prints a table of the output.
sort -n
Because for (k in h) does not produce keys in any particular order, we put the output through sort.
With awk 4.0 2D arrays
sample input matrix of n=3 columns containing integer values
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
4 0 0
7 -1 -2
output is vector of data values in column 0 that occur in input followed by matrix of n=3 columns with count of each data value in corresponding column of input matrix
-1 0 1 0
-2 0 0 1
0 2 4 6
1 3 1 0
2 0 1 0
4 1 0 0
7 1 0 0
code
awk '
NR==1 {ncols=NF}
{for(i=1; i <=NF; ++i) ++c[$i][i-1]}
END{
for(i in c) {
printf("%d ", i)
for(j=0; j < ncols; ++j) {
printf("%d ", j in c[i]?c[i][j]: 0)
}
printf("\n")
}
}
'

how to subtract line by line from two different files

I have two files as below:
file1 (10 lines) :
1 0 1 0 1 1 1 1 1
1 1 1 2 3 4 5 1 1
......
file2 (4 lines ) :
1 2 3 1 1 1 1 1 1
1 2 1 1 1 1 1 3 1
.......
I want to take difference for each row in file1 between file2 and file1 and save it to different file3$i. which may look like:
for 1st row of file1
file31 (4 lines) :
0 2 2 1 0 0 0 0 0
0 2 0 1 0 0 0 2 0
..........
To be more clear.
For the 1st row of file1 the output is like:
1st row of file2 - 1st row of file1
2nd row of file2 - 1st row of file1
3rd row of file2 - 1st row of file1
4th row of file2 - 1st row of file1
For the 2nd row of file1 the output is like:
1st row of file2 - 2nd row of file1
2nd row of file2 - 2nd row of file1
3rd row of file2 - 2nd row of file1
4th row of file2 - 2nd row of file1
and so on for the rest of the rows in file1.
I tried using awk but could not find a smart solution.
You can use this awk command:
awk 'FNR==NR {
for(i=1; i<=NF; i++)
a[FNR,i]=$i
nr1=FNR
next
} {
out="outfile" FNR
for(r=1; r<=nr1; r++) {
for(i=1; i<=NF; i++) {
v = a[r,i] - $i
printf "%s%s", (v>0?v:0), (i==NF)?ORS:OFS > out
}
}
close(out)
}' file2 file1
This command will output a separate output file for each row in file1, with names as outfile1, outfile2, outfile3 etc.
Output:
cat outfile1
0 2 2 1 0 0 0 0 0
0 2 0 1 0 0 0 2 0
cat outfile2
0 1 2 0 0 0 0 0 0
0 1 0 0 0 0 0 2 0

Help with duplicating rows based on a field using awk

I have the following data set with the 3rd field consists of 0's and 1's
Input
1 2 1
2 4 0
3 3 1
4 1 1
5 0 0
I wish to expand the data set to the following format
Duplicate each row based on the 2nd field and
Replace only the "new" 1's (obtain after duplication) in the 3rd field by 0
How can I do this with AWK?
Thanks
Output
1 2 1
1 2 0
2 4 0
2 4 0
2 4 0
2 4 0
3 3 1
3 3 0
3 3 0
4 1 1
awk '{print; $3=0; for (i=1; i<$2; i++) print}' inputfile
If you want to actually skip records with a zero in the second field (as your example seems to show):
awk '{if ($2>0) print; $3=0; for (i=1; i<$2; i++) print}' inputfile

Resources