Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24
I have many files with three columns in a form of:
file1 | file2
1 0 1 | 1 0 2
2 3 3 | 2 3 7
3 6 2 | 3 6 0
4 1 0 | 4 1 3
5 2 4 | 5 2 1
First two columns are the same in each file. I want to calculate a sum of 3 columns from every file to receive something like this:
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
For two files awk 'FNR==NR { _a[FNR]=$3;} NR!=FNR { $3 += _a[FNR]; print; }' file*
work perfectly (I found this solution via google). How to change it on many files?
All you need is:
awk '{sum[FNR]+=$3} ARGIND==(ARGC-1){print $1, $2, sum[FNR]}' file*
The above used GNU awk for ARGIND. With other awks just add FNR==1{ARGIND++} at the start.
Since the first two columns are same in each file:
awk 'NR==FNR{b[FNR]=$1 FS $2;}{a[FNR]+=$3}END{for(i=1;i<=length(a);i++){print b[i] FS a[i];}}' file*
Array a is used to have the cumulative sum of the 3rd column of all files.
Array b is used to the 1st and 2nd column values
In the end, we print the contents of array a and b
file1
$ cat f1
1 0 1
2 3 3
3 6 2
4 1 0
5 2 4
file2
$ cat f2
1 0 2
2 3 7
3 6 0
4 1 3
5 2 1
Output
$ awk -v start=3 'NF{for(i=1; i<=NF; i++)a[FNR, i] = i>=start ? a[FNR, i]+$i : $i }END{ for(j=1; j<=FNR; j++){ s = ""; for(i=1; i<=NF; i++){ s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "") } print s } }' f1 f2
1 0 3
2 3 10
3 6 2
4 1 3
5 2 5
Better Readable
variable start decides from which column start summing, suppose if you set 2 it will start summing from column2, column3 ...and so on, from all files, since you have equal no of fields and rows, it works well
awk -v start=3 '
NF{
for(i=1; i<=NF; i++)
a[FNR, i] = i>=start ? a[FNR, i]+$i : $i
}
END{
for(j=1; j<=FNR; j++)
{
s = "";
for(i=1; i<=NF; i++)
{
s = (s ? s OFS:"")((j,i) in a ? a[j,i] : "")
}
print s
}
}
' f1 f2
I have a file with multiple columns (greater than 1000). Each column has numbers 0, 1 or some other. The tab delimited file looks like :
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
I want to calculate the occurrence of each unique digit for each column in the file. How do I do that using AWK or shell ?
P.S To calculate the occurrence of each unique digit in first column, i used AWK code :
awk '{h[$1]++}; END {for (k in h) print k, h[k]}' file > output-file
It gives the results as :
0 2
1 3
which means 0 occurs twice in column 1 and 1 occurs thrice in column 1.
I want to do the same for a file having over 1000 columns.
You just need to make the keys for associative array h contain both column number, i, and column value, $i:
$ awk '{for (i=1;i<=NF;i++) h[i" "$i]++}; END {for (k in h) print k, h[k]}' file | sort -n
1 0 2
1 1 3
2 0 3
2 1 1
2 2 1
3 0 5
The last line above indicates that column 3 has the value 0 occurring 5 times.
In more detail:
for (i=1;i<=NF;i++) h[i" "$i]++
This loops over all columns from the first, i-=1, to the last, i=NF. For each column, it updates the counter h for that column and its value.
END {for (k in h) print k, h[k]}
This prints a table of the output.
sort -n
Because for (k in h) does not produce keys in any particular order, we put the output through sort.
With awk 4.0 2D arrays
sample input matrix of n=3 columns containing integer values
0 0 0
0 0 0
1 2 0
1 0 0
1 1 0
4 0 0
7 -1 -2
output is vector of data values in column 0 that occur in input followed by matrix of n=3 columns with count of each data value in corresponding column of input matrix
-1 0 1 0
-2 0 0 1
0 2 4 6
1 3 1 0
2 0 1 0
4 1 0 0
7 1 0 0
code
awk '
NR==1 {ncols=NF}
{for(i=1; i <=NF; ++i) ++c[$i][i-1]}
END{
for(i in c) {
printf("%d ", i)
for(j=0; j < ncols; ++j) {
printf("%d ", j in c[i]?c[i][j]: 0)
}
printf("\n")
}
}
'
I have two files as below:
file1 (10 lines) :
1 0 1 0 1 1 1 1 1
1 1 1 2 3 4 5 1 1
......
file2 (4 lines ) :
1 2 3 1 1 1 1 1 1
1 2 1 1 1 1 1 3 1
.......
I want to take difference for each row in file1 between file2 and file1 and save it to different file3$i. which may look like:
for 1st row of file1
file31 (4 lines) :
0 2 2 1 0 0 0 0 0
0 2 0 1 0 0 0 2 0
..........
To be more clear.
For the 1st row of file1 the output is like:
1st row of file2 - 1st row of file1
2nd row of file2 - 1st row of file1
3rd row of file2 - 1st row of file1
4th row of file2 - 1st row of file1
For the 2nd row of file1 the output is like:
1st row of file2 - 2nd row of file1
2nd row of file2 - 2nd row of file1
3rd row of file2 - 2nd row of file1
4th row of file2 - 2nd row of file1
and so on for the rest of the rows in file1.
I tried using awk but could not find a smart solution.
You can use this awk command:
awk 'FNR==NR {
for(i=1; i<=NF; i++)
a[FNR,i]=$i
nr1=FNR
next
} {
out="outfile" FNR
for(r=1; r<=nr1; r++) {
for(i=1; i<=NF; i++) {
v = a[r,i] - $i
printf "%s%s", (v>0?v:0), (i==NF)?ORS:OFS > out
}
}
close(out)
}' file2 file1
This command will output a separate output file for each row in file1, with names as outfile1, outfile2, outfile3 etc.
Output:
cat outfile1
0 2 2 1 0 0 0 0 0
0 2 0 1 0 0 0 2 0
cat outfile2
0 1 2 0 0 0 0 0 0
0 1 0 0 0 0 0 2 0
I have the following data set with the 3rd field consists of 0's and 1's
Input
1 2 1
2 4 0
3 3 1
4 1 1
5 0 0
I wish to expand the data set to the following format
Duplicate each row based on the 2nd field and
Replace only the "new" 1's (obtain after duplication) in the 3rd field by 0
How can I do this with AWK?
Thanks
Output
1 2 1
1 2 0
2 4 0
2 4 0
2 4 0
2 4 0
3 3 1
3 3 0
3 3 0
4 1 1
awk '{print; $3=0; for (i=1; i<$2; i++) print}' inputfile
If you want to actually skip records with a zero in the second field (as your example seems to show):
awk '{if ($2>0) print; $3=0; for (i=1; i<$2; i++) print}' inputfile