Sorting uniq strings, creatings columns and averaging - sorting

input file:
civil 4
posición 3
formación 7
posición 5
domingo 1
retrato 5
retrato 6
civil 6
formación 3
retrato 7
domingo 7
media 1
media 1
I want output as:
civil 4 domingo 1 formación 3 media 1 posición 3 retrato 5
civil 6 domingo 7 formación 7 media 1 posición 5 retrato 6
average# average# average# average# average# retrato 7
average#
so I can do sort -t"," to get the original input as
civil 4
civil 6
domingo 1
domingo 7
formación 3
formación 7
media 1
media 1
posición 3
posición 5
retrato 5
retrato 6
retrato 7
and something like awk '{x+=$insertcolumn} END { for (x> 0) print x/NR }' to get the averages but how do I get the column format in the middle step?

$ cat tst.awk
BEGIN { nw=length("average"); vw=1 }
!seenCnt[$1]++ { keys[++numKeys]=$1 }
{
vals[$1,seenCnt[$1]] = $2
nw = (length($1) > nw ? length($1) : nw)
vw = (length($2) > vw ? length($2) : vw)
numRows = (seenCnt[$1] > numRows ? seenCnt[$1] : numRows)
}
END {
for (rowNr=1; rowNr<=(numRows+1); rowNr++) {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
name = val = ""
if ( (key,rowNr) in vals ) {
name = key
val = vals[key,rowNr]
sum[key] += vals[key,rowNr]
}
else if (key in sum) {
name = "average"
val = sum[key]/(rowNr-1)
delete sum[key]
}
printf "%-*s %*s%s", nw, name, vw, val, (keyNr<numKeys?OFS:ORS)
}
}
}
.
$ sort file | awk -f tst.awk
civil 4 domingo 1 formación 3 media 1 posición 3 retrato 5
civil 6 domingo 7 formación 7 media 1 posición 5 retrato 6
average 5 average 4 average 5 average 1 average 4 retrato 7
average 6

Considering your input has comma separated values:
Code
gawk <inputFile -F, 'BEGIN{max=0; maxl=0}$2 != ""{x=$1; a[x][0]+=$2; l=length(a[x]); a[x][l]=$2; if (l > max) max=l; l2=length($1); if (l2>maxl) maxl=l2}END{i=0; n=maxl+2; while (i<max){i++; for (j in a) {if (!a[j][i]) {printf("%"n"s %2s","",""); if (!b[j]) b[j]=a[j][0]/(i-1)} else {printf("%"n"s %2s",j,a[j][i]); if (i==max) b[j]=a[j][0]/i}}; print ""; }; print ""; for (j in a) {printf("%"maxl"s %.2f","avg",b[j])}; print ""}'
Explained version
BEGIN {
max=0 # used to know how many lines to print
maxl=0 # used to know how wide a column will be
}
$2 != "" { # For all non-empty lines, do this block
x=$1
a[x][0]+=$2 # create the sum while reading input
# also used to make a[x] an array
l=length(a[x])
a[x][l]=$2 # appending to the array the new value
if (l > max) max=l
l2=length($1)
if (l2>maxl) maxl=l2 # getting the longest word length
}
END {
i=0
n=maxl+2 # pretty print with additional spaces
while (i<max){
i++ # skip 0-value which is the sum
for (j in a) {
if (!a[j][i]) {
printf("%"n"s %2s","","") # empty column
if (!b[j]) b[j]=a[j][0]/(i-1) # calculate average
} else {
printf("%"n"s %2s",j,a[j][i]) # show column
if (i==max) b[j]=a[j][0]/i # calculate average
}
}
print "" # start next line
}
print "" # skip a line
for (j in a) {
printf("%"maxl"s %.2f","avg",b[j]) # print averages
}
print "" # end output with a newline
}
Input
civil,4
posición,3
formación,7
posición,5
domingo,1
retrato,5
retrato,6
civil,6
formación,3
retrato,7
domingo,7
media,1
media,1
Output
domingo 1 posición 3 media 1 retrato 5 civil 4 formación 7
domingo 7 posición 5 media 1 retrato 6 civil 6 formación 3
retrato 7
avg 4.00 avg 4.00 avg 1.00 avg 6.00 avg 5.00 avg 5.00
Edit for non-gawk
Awk cannot use length() on arrays, so we will store the length in another array.
l=length(a[x])
a[x][l]=$2
if (l > max) max=l
Needs to be changed into
l[x]++
a[x][l[x]]=$2
if (l[x] > max) max=l[x]
awk one-liner
awk <inputFile -F, 'BEGIN{max=0; maxl=0}$2 != ""{x=$1; a[x][0]+=$2; l[x]++; a[x][l[x]]=$2; if (l[x] > max) max=l[x]; l2=length($1); if (l2>maxl) maxl=l2}END{i=0; n=maxl+2; while (i<max){i++; for (j in a) {if (!a[j][i]) {printf("%"n"s %2s","",""); if (!b[j]) b[j]=a[j][0]/(i-1)} else {printf("%"n"s %2s",j,a[j][i]); if (i==max) b[j]=a[j][0]/i}}; print ""; }; print ""; for (j in a) {printf("%"maxl"s %.2f","avg",b[j])}; print ""}'
(to use awk if you have gawk, use gawk --posix)
Bonus
Left as an exercise for the reader:
Replace the last for (...){print ...} loop to allow the output columns to be alphabetically sorted.

Related

Find linear trend up to the maximum value using awk

I have a datafile as below:
ifile.txt
-10 /
-9 /
-8 /
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
5 40
6 35
7 30
8 25
9 /
10 /
Here "/" are the missing values. I would like to compute the linear trend up to the maximum value in the y-axis (i.e. up to the value "41" in 2nd column). So it should calculate the trend from the following data:
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
Other (x, y) won't be consider because the y values are less than 41 after (4, 41)
The following script is working fine for all values:
awk '!/\//{sx+=$1; sy+=$2; c++;
sxx+=$1*$1; sxy+=$1*$2}
END {det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")}' ifile.txt
But I can't able to do it for maximum value
For the given example the result will be 3.486
Updated based on your comments. I assumed your trend calculations were good and used them:
$ awk '
$2!="/" {
b1[++j]=$1 # buffer them up until or if used
b2[j]=$2
if(max=="" || $2>max) { # once a bigger than current max found
max=$2 # new champion
for(i=1;i<=j;i++) { # use all so far buffered values
# print b1[i], b2[i] # debug to see values used
sx+=b1[i] # Your code from here on
sy+=b2[i]
c++
sxx+=b1[i]*b1[i]
sxy+=b1[i]*b2[i]
}
j=0 # buffer reset
delete b1
delete b2
}
}
END {
det=c*sxx-sx*sx
print (det?(c*sxy-sx*sy)/det:"DIV0")
}' file
For data:
0 /
1 1
2 2
3 4
4 3
5 5
6 10
7 7
8 8
with debug print uncommented program would output:
1 1
2 2
3 4
4 3
5 5
6 10
1.51429
You can do the update of the concerned rows only when $2 > max and save the intermediate rows into variables. for example using associate arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add all elemenet of a1 to a and b1 to b
for (k in a1) {
a[k] = a1[k]; b[k] = b1[k]
}
# add the current row to a, b
a[NR] = $1; b[NR] = $2;
# reset a1, b1
delete a1; delete b1;
next;
}
# if $2 <= max, then set a1, b1
{ a1[NR] = $1; b1[NR] = $2 }
END{
for (k in a) {
#print k, a[k], b[k]
sx += a[k]; sy += b[k]; sxx += a[k]*a[k]; sxy += a[k]*b[k]; c++
}
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt
#3.48601
Or calculate sx, sy etc directly instead of using arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add the current Row plus the cached values
sx += $1+sx1; sy += $2+sy1; sxx += $1*$1+sxx1; sxy += $1*$2+sxy1; c += 1+c1
# reset the cached variables
sx1 = 0; sy1 = 0; sxx1 = 0; sxy1 = 0; c1 = 0;
next;
}
# if $2 <= max, then calculate and cache the values
{ sx1 += $1; sy1 += $2; sxx1 += $1*$1; sxy1 += $1*$2; c1++ }
END{
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt

Standard deviation of multiple files having different row sizes

I have few files with different row sizes, but number of columns in each file is same. e.g.
ifile1.txt
1 1001 ? ?
2 1002 ? ?
3 1003 ? ?
4 1004 ? ?
5 1005 ? 0
6 1006 ? 1
7 1007 ? 3
8 1008 5 4
9 1009 3 11
10 1010 2 9
ifile2.txt
1 2001 ? ?
2 2002 ? ?
3 2003 ? ?
4 2004 ? ?
5 2005 ? 0
6 2006 6 12
7 2007 6 5
8 2008 9 10
9 2009 3 12
10 2010 5 7
11 2011 2 ?
12 2012 9 ?
ifile3.txt
1 3001 ? ?
2 3002 ? 6
3 3003 ? ?
4 3004 ? ?
5 3005 ? 0
6 3006 1 25
7 3007 2 3
8 3008 ? ?
In each file 1st column represents the index number and 2nd column as ID.
I would like to calculate the standard deviation for each index number from 3rd column onward.
The desired output:
1 ? ? ---- [Here ? is computed from ?, ?, ?] So answer is ?
2 ? ? ---- [Here 6 is computed from ?, ?, 6] So answer is ? as only one sample
3 ? ?
4 ? ?
5 ? 0.00 ----- [Here 0 is computed from 0, 0, 0] So answer is as all are same value
6 3.54 12.01
7 2.83 1.15
8 2.83 4.24 ----- [Here 7 is computed from 5, 9, ?]
9 0.00 0.71
10 2.12 1.41
11 ? ?
12 ? ?
I am trying to change the following script which works for mean values (Copied from Average of multiple files having different row sizes)
{
c = NF
if (r<FNR) r = FNR
for (i=3;i<=NF;i++) {
if ($i != "?") {
s[FNR "," i] += $i
n[FNR "," i] += 1
}
}
}
END {
for (i=1;i<=r;i++) {
printf("%s\t", i)
for (j=3;j<=c;j++) {
if (n[i "," j]) {
printf("%.1f\t", s[i "," j]/n[i "," j])
} else {
printf("?\t")
}
}
printf("\n")
}
}
I understand that I need to modify the script with something like below but can't able to do that.
mean=s[i "," j]/n[i "," j]
for (i=1; i in array ; i++)
sqdif+=(array[i]-mean)**2
printf("%.1f\t", sqdif/(n[i "," j]-1)**0.5)
You need to save the original numbers on column 3 to NF in order to calculate std. one way you can try is to concatenate them into the array values (see v in the below code) and later do split to retrieve them in the final calculation of the END block, for example:
$ cat test.awk
{
nMax = FNR > nMax ? FNR : nMax # get the max FNR from all files
for (j=3; j<=NF; j++) {
if ($j == "?") continue
v[FNR, j] = v[FNR, j] == "" ? $j : v[FNR, j] FS $j # concatenate values of (FNR,j) in `v` using FS
t[FNR, j] += $j # calculate total for each (FNR,j)
}
}
END {
for (i=1; i<=nMax; i++) {
printf("%d\t", i)
for (j=3; j<=NF; j++) {
if ((i,j) in t) { # if (i,j) exists, split v into vals using default FS
n = split(v[i,j], vals)
if (n == 1) { # print "?" if only 1 item in array vals
printf("?")
} else { # otherwise, calculate mean `e`, sum `s` and then std
e = t[i,j]/n
s = 0
for(x in vals) s += (vals[x]-e)**2
printf("%.2f", sqrt(s/(n-1)))
}
} else { # print "?" if (i,j) not exists
printf("?")
}
printf(j==NF?"\n":"\t")
}
}
}
Result running the above code:
$ awk -f test.awk ifile*.txt
1 ? ?
2 ? ?
3 ? ?
4 ? ?
5 ? 0.00
6 3.54 12.01
7 2.83 1.15
8 2.83 4.24
9 0.00 0.71
10 2.12 1.41
11 ? ?
12 ? ?

Count the occurences of a number in all the columns in bash

I have a data set like this:
1 3 3 4 5 2 3 3
2 2 2 1 2 2 2 2
1 3 3 3 3 3 3 3
1 4 4 4 4 4 4 3
I would like to count the number of times that the number "one" appears per column, so I would like the output like:
3 0 0 1 0 0 0 0
Does anyone know how to do it in bash?
Thank you very much!
Ana
Do it in awk. Iterate over number of fields and if the field is equal to 1 increment the array. Then on the end print the array.
awk '{ for (i = 1; i <= NF; ++i) { if($i == 1) { ++c[i]; } }
END{ for (i = 1; i <= NF; ++i) { printf "%d%s", c[i], i!=NF ? OFS : ORS; } }

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

awk skipping records. getline command

this is a task related to data compression using fibonacci binary representation.
what i have is this text file:
result.txt
a 20
b 18
c 18
d 15
e 7
this file is a result of scanning a text file and counting the appearances of each char on the file using awk.
now i need to give each char its fibonacci-binary representation length.
since i'm new to ubuntu and teminal, i've done a program in java that receives a number and prints all the fibonacci codewords length up to the number and it's working.
this is exactly what i'm trying to do here. the problem is that it doesn't work...
the length of fibonacci codewords is also work as fibonnaci.
these are the rules:
f(1)=1 - there is 1 codeword of length 1.
f(2)=1 - there is 1 codeword of length 2.
f(3)=2 - there is 2 codeword of length 3.
f(4)=3 - there is 3 codeword of length 4.
and so on...
(i'm adding on more bit to each codeword so the first two lengths will be 2 and 3)
this is the code i've made: its name is scr5
{
a=1;
b=1;
len=2
print $1 , $2, len;
getline;
print $1 ,$2, len+1;
getline;
len=4;
for(i=1; i< num; i++){
c= a+b;
g=c;
while (c >= 1){
print $1 ,$2, len ;
if (getline<=0){
print "EOF"
exit;
}
c--;
i++;
}
a=b;
b=c;
len++;
}}
now i write on terminal:
n=5
awk -v num=$n -f scr5 a
and there are two problems:
1. it skips the third letter c.
2. on the forth letter d, it prints the length of the first letter, 2, instead of length 3.
i guess that there is a problem in the getline command.
thank u very much!
Search Google for getline and awk and you'll mostly find reasons to avoid getline completely! Often it's a sign you're not really doing things the "awk" way. Find an awk tutorial and work through the basics and I'm sure you'll see quickly why your attempt using getlines is not getting you off in the right direction.
In the script below, the BEGIN block is run once at the beginning before any input is read, and then the next block is automatically run once for each line of input --- without any need for getline.
Good luck!
$ cat fib.awk
BEGIN { prior_count = 0; count = 1; len = 1; remaining = count; }
{
if (remaining == 0) {
temp = count;
count += prior_count;
prior_count = temp;
remaining = count;
++len;
}
print $1, $2, len;
--remaining;
}
$ cat fib.txt
a 20
b 18
c 18
d 15
e 7
f 0
g 0
h 0
i 0
j 0
k 0
l 0
m 0
$ awk -f fib.awk fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6
The above solution, compressed form :
mawk 'BEGIN{ ___= __= _^=____=+_ } !_ { __+=(\
____=___+_*(_=___+=____))^!_ } $++NF = (_--<_)+__' fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6

Resources