sum,count,distinct count fields with awk - shell

I have a text file like this:
A B C D E
----------------------
x x e 2 10
y y g 1 8
z o e 2 9
o o q 1 10
p z e 3 22
x x e 1 11
z o a 1 24
y z b 1 25
I want to use awk do the same thing as this SQL does:
select A,
B,
count(distinct C),
sum(D),
sum(case when E>20 then E else 0 END)
from test
group by A,B
output:
A B count(distinct C) sum(D) sum(case when E>20 then E else 0 END)
-------------------------------------------------------
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24
Here is my solution but the distinct part is not completed:
awk '
{
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (i in idx4) print i, idx4[i], idx5[i]
}' OFS="\t" test
=============================================================================
I have completed this by hours, here is my code:
{
if (idx3[$1"|"$2, $3] == 0) {
idx3[$1"|"$2, $3]+=1;
}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (j in idx3) {
split(j, idx, SUBSEP)
count[idx[1]]++
}
for (i in idx4) {
print i, count[i], idx4[i], idx5[i]
}
} OFS="\t"
#Scrutinizer has given a more readable code below, i think that is better.

Try this (similar to your own solution):
awk '
NR<3{
next
}
{
i=$1 OFS $2
D[i]+=$4
}
!A[i,$3]++{
C[i]++
}
$5>20{
E[i]+=$5
}
END{
for(i in D)print i, C[i], D[i], E[i]+0
}
' OFS='\t' infile
NR<3 is used to skip the two header lines. If they are not present in the input file you can leave that section out.

please try this script, I had test it can oupt the result as you expect.
awk '
{
if( NR<3) {next}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
if( index(TR[$1"|"$2],$3)==0 )
{
TR[$1"|"$2] = TR[$1"|"$2]"|"$3;
TRD[$1"|"$2] +=1;
}
}
END {
for (i in idx4) print i, TRD[i], idx4[i], idx5[i]+0
}' OFS="\t" test

$ gawk '{ a[$1,$2]["C"][$3]; a[$1,$2]["D"]+=$4; a[$1,$2]["E"]+=($5>20 ? $5:0) }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( i in a) {
split(i, b, SUBSEP)
print b[1], b[2], length(a[i]["C"]), a[i]["D"], a[i]["E"]
}
}' OFS='\t' file
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24

Related

Understanding AWK and CSV files

How can I write an AWK program that analyses a list of fields in CSV files, count the number of each different string in the specified field, and print out the count of each string that is found? I have only coded in C and Java, so I am completely confused on the syntax of AWK. I understand the simplest of concepts, however, AWK is structured much differently. Any time is appreciated, thank you!
BEGIN {
FS = ""
}
{
for(i = 1; i <= NF; i++)
freq[$i]++
PROCINFO ["sorted_in"] = "#val_num_desc" #this got the desired result
}
END {
for {this in freq)
printf "%s\t%d\n", this, freq[this]
}
On a CSV file containing:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
I am able to obtain the result:
A 2
B 1
C 3
Q 1
D 1
E 2
F 1
, 12
G 1
W 1
Field1,Field2,Field3,Field4 1
Z 2
This is the desired result:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
Field1 1
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
There is an edit to my code which is commented.
Fixed your code:
$ awk '
BEGIN { # you need BEGIN block for FS
FS = ", *" # your data had ", " and "," seps
} # ... based on your sample output
{
for(i = 1; i <= NF; i++)
freq[$i]++
}
END {
for(this in freq) # fixed a parenthesis
printf "%s\t%d\n", this, freq[this]
}' file
Output (using GNU awk. Other awks displayed output in different order):
A 2
B 1
C 3
Q 1
D 2
Field1 1
E 2
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
Z 2
AWK really isn't the right tool for this job. While AWK can interpret Comma or Tab separated data, it has no concept of field enclosures or escapes. So it could handle a simple example like:
Month,Day
January,Sunday
February,Monday
but would fail with this valid example:
Month,Day
January,"Sunday"
February,"Monday"
Because of that, I would recommend considering another language. Something like Python:
import csv
o = open('a.csv')
for m in csv.DictReader(o):
print(m)
https://docs.python.org/library/csv.html
Or Ruby:
require 'csv'
CSV.table('a.csv').each do |m|
p m
end
https://ruby-doc.org/stdlib/libdoc/csv/rdoc/CSV.html
Or even PHP:
<?php
$r = fopen('a.csv', 'r');
$a_head = fgetcsv($r);
while (true) {
$a_row = fgetcsv($r);
if (feof($r)) {
break;
}
$m_row = array_combine($a_head, $a_row);
print_r($m_row);
}
https://php.net/function.fgetcsv

How to create a pivot table from a CSV file having subgroups and getting the count of the last values using shell script?

I want to group the columns then form subsequent group getting the count of last column values.
For example main Group A, Subgroup D, J , P and count of P in the subsequent groups as well as the total count of last column.
I am able to form groups but subgroup seems a little hard. Any help is appreciated like how to get this.
Input:
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
Output:
A D J P 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspQ 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK P 2
A E J Q 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK Q 1
B F L R 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspM S 1
C H N T 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspO U 2
&nbsp&nbsp&nbsp&nbspTotal 14
Here's a different approach, a shell script that uses sqlite to calculate the group counts (Requires 3.25 or newer because it uses window functions):
#!/bin/sh
file="$1"
sqlite3 -batch -noheader <<EOF
CREATE TABLE data(c1 TEXT, c2 TEXT, c3 TEXT, c4 TEXT);
.mode csv
.import "$file" data
.mode list
.separator " "
SELECT (CASE c1 WHEN lag(c1, 1) OVER (PARTITION BY c1 ORDER BY c1) THEN ' ' ELSE c1 END)
, (CASE c2 WHEN lag(c2, 1) OVER (PARTITION BY c1,c2 ORDER BY c1,c2) THEN ' ' ELSE c2 END)
, (CASE c3 WHEN lag(c3, 1) OVER (PARTITION BY c1,c2,c3 ORDER BY c1,c2,c3) THEN ' ' ELSE c3 END)
, c4
, count(*)
FROM data
GROUP BY c1, c2, c3, c4
ORDER BY c1, c2, c3, c4;
SELECT 'Total ' || count(*) FROM data;
EOF
Running this gives:
$ ./group.sh example.csv
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
Also a one-liner using datamash, though it doesn't include the fancy output format:
$ datamash -st, groupby 1,2,3,4 count 4 < example.csv | tr , ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
Using Perl
Script
perl -0777 -lne '
s/^(.+?)$/$x++;$kv{$1}++/mge;
foreach my $k (sort keys %kv)
{ $q=$c=$k;
while(length($p) > 0)
{
last if $c=~/^$p/g;
$q=substr($c,length($p)-1);
$p=~s/(.$)//;
}
printf( "%9s\n", "$q $kv{$k}") ;
$p=$k;
}
print "Total $x";
' anurag.txt
Output:
A,D,J,P 1
Q 1
K,P 2
E,J,Q 2
K,Q 1
B,F,L,R 2
M,S 1
C,H,N,T 2
O,U 2
Total 14
$ cat tst.awk
BEGIN { FS="," }
!($0 in cnt) { recs[++numRecs] = $0 }
{ cnt[$0]++ }
END {
for (recNr=1; recNr<=numRecs; recNr++) {
rec = recs[recNr]
split(rec,f)
newVal = 0
for (i=1; i<=NF; i++) {
if (f[i] != p[i]) {
newVal = 1
}
printf "%s%s", (newVal ? f[i] : " "), OFS
p[i] = f[i]
}
print cnt[rec]
tot += cnt[rec]
}
print "Total", tot+0
}
$ awk -f tst.awk file
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
I'll propose a multi stage solution in the spirit of unix toolset.
create a sorted, counted, de-delimited data format
$ sort file | uniq -c | awk '{print $2,$1}' | tr ',' ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
now, the task is removing the longest left common substring from consecutive lines
... | awk 'NR==1 {p=$0}
NR>1 {k=0;
while(p~t=substr($0,1,++k));
gsub(/./," ",t); sub(/^ /,"",t);
p=$0; $0=t substr(p,k)}1'
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
whether it's easier to understand than one script will be seen.
I have not exactly an answer that produces your example output but I'm close enough to dare posting an answer
Now I have an answer that produces exactly your example output... :-)
$ cat ABCD
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
$ awk '{a[$0]+=1}END{for(i in a) print i","a[i];print "Total",NR}' ABCD |\
sort | \
awk -F, '
/Total/{print;next}
{print a1==$1?" ":$1,a2==$2?" ":$2,a3==$3?" ":$3,a4==$4?" ":$4,$5
a1=$1;a2=$2;a3=$3;a4=$4}'
A D J P 1
Q 1
K P 2
E J Q 2
K 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
$
The first awk script iterates on every line and at every line we increment the value of an array, a, element, indexed by the whole line value, next at the end (END target) we loop on the indices of a to print the index and the associated value, that is the count of the times we have that line in the data - eventually we output also the total number of lines processed, that is automatically updated in the variable NR, number of records.
The second awk script either prints the total line and skips any further processing or it compares each field (splitted on commas) with the corresponding field of the previous line and output the new field or a space accordingly.

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

replacing specific value (from another file) using awk

I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.

awk skipping records. getline command

this is a task related to data compression using fibonacci binary representation.
what i have is this text file:
result.txt
a 20
b 18
c 18
d 15
e 7
this file is a result of scanning a text file and counting the appearances of each char on the file using awk.
now i need to give each char its fibonacci-binary representation length.
since i'm new to ubuntu and teminal, i've done a program in java that receives a number and prints all the fibonacci codewords length up to the number and it's working.
this is exactly what i'm trying to do here. the problem is that it doesn't work...
the length of fibonacci codewords is also work as fibonnaci.
these are the rules:
f(1)=1 - there is 1 codeword of length 1.
f(2)=1 - there is 1 codeword of length 2.
f(3)=2 - there is 2 codeword of length 3.
f(4)=3 - there is 3 codeword of length 4.
and so on...
(i'm adding on more bit to each codeword so the first two lengths will be 2 and 3)
this is the code i've made: its name is scr5
{
a=1;
b=1;
len=2
print $1 , $2, len;
getline;
print $1 ,$2, len+1;
getline;
len=4;
for(i=1; i< num; i++){
c= a+b;
g=c;
while (c >= 1){
print $1 ,$2, len ;
if (getline<=0){
print "EOF"
exit;
}
c--;
i++;
}
a=b;
b=c;
len++;
}}
now i write on terminal:
n=5
awk -v num=$n -f scr5 a
and there are two problems:
1. it skips the third letter c.
2. on the forth letter d, it prints the length of the first letter, 2, instead of length 3.
i guess that there is a problem in the getline command.
thank u very much!
Search Google for getline and awk and you'll mostly find reasons to avoid getline completely! Often it's a sign you're not really doing things the "awk" way. Find an awk tutorial and work through the basics and I'm sure you'll see quickly why your attempt using getlines is not getting you off in the right direction.
In the script below, the BEGIN block is run once at the beginning before any input is read, and then the next block is automatically run once for each line of input --- without any need for getline.
Good luck!
$ cat fib.awk
BEGIN { prior_count = 0; count = 1; len = 1; remaining = count; }
{
if (remaining == 0) {
temp = count;
count += prior_count;
prior_count = temp;
remaining = count;
++len;
}
print $1, $2, len;
--remaining;
}
$ cat fib.txt
a 20
b 18
c 18
d 15
e 7
f 0
g 0
h 0
i 0
j 0
k 0
l 0
m 0
$ awk -f fib.awk fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6
The above solution, compressed form :
mawk 'BEGIN{ ___= __= _^=____=+_ } !_ { __+=(\
____=___+_*(_=___+=____))^!_ } $++NF = (_--<_)+__' fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6

Resources