Sort values and output the indices of their sorted columns - sorting

I've got a file that looks like:
20 30 40
80 70 60
50 30 40
Each column represents a procedure. I want to know how the procedures did for each row. My ideal output would be
3 2 1
1 2 3
1 3 2
i.e. in row 1, the third column had the highest value, followed by the second, then the first smallest (this can be reversed, doesn't matter).
How would I do this?

I'd do it with some other Unix tools (read, cat, sort, cut, tr, sed, and bash of course):
while read line
do
cat -n <(echo "$line" | sed 's/ /\n/g') | sort -r -k +2 | cut -f1 | tr '\n' ' '
echo
done < input.txt
The output looks like this:
3 2 1
1 2 3
1 3 2

Another solution using Python:
$ python
Python 2.7.6 (default, Jan 26 2014, 17:25:18)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> with open('file.txt') as f:
... lis=[x.split() for x in f]
...
>>> for each in lis:
... each = [i[0] + 1 for i in sorted(enumerate(each), key=lambda x:x[1], reverse=True)]
... print ' '.join([str(item) for item in each])
...
3 2 1
1 2 3
1 3 2

Using Gnu Awk version 4:
$ awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_desc" }
{
split($0,a," ")
for (i in a) printf "%s%s", i,OFS
print ""
}' file
3 2 1
1 2 3
1 3 2

If you have GNU awk then you can do something like:
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
$ cat file
20 30 40
0.923913 0.913043 0.880435 0.858696 0.826087 0.902174 0.836957 0.880435
80 70 60
50 30 40
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
3 2 1
1 2 6 8 3 4 7 5
1 2 3
1 3 2

Solution via perl
#!/usr/bin/perl
open(FH,'<','/home/chidori/input.txt') or die "Can't open file$!\n";
while(my $line=<FH>){
chomp($line);
my #unsorted_array=split(/\s/,$line);
my $count=scalar #unsorted_array;
my #sorted_array = sort { $a <=> $b } #unsorted_array;
my %hash=map{$_ => $count--} #sorted_array;
foreach my $value(#unsorted_array){
print "$hash{$value} ";
}
print "\n";
}

Related

How to normalize the values of specific columns of a csv with awk?

I have a csv with several variables and I would like to normalize only some specific columns using the standard deviation.
The value minus the mean of the variable divided by the standard deviation of the variable.
The file is comma separated and the transformations needs to be done only with awk to the variables months_loan_duration and amount.
The input would look like this but with a thousand rows:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,6,critical,radio/tv,1169.53
1 - 200 DM,48,repaid,radio/tv,5951.78
,12,critical,education,2096.23
And the output would be like this:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-1.236,critical,radio/tv,-0.745
1 - 200 DM,2.248,repaid,radio/tv,0.95
,-0.738,critical,education,-0.417
So far I have tried the following unsuccessfully:
#! /usr/bin/awk -f
BEGIN{FS=","; OFS=",";numberColumn=NF}
NR!=1
{
for(i=1;i <= numberColumn;i++)
{
total[i]+=$i;
totalSquared[i]+=$i^2;
}
for (i=1;i <= numberColumn;i++)
{
avg[i]=total[i]/(NR-1);
std[i]=sqrt((totalSquared[i]/(NR-1))-avg[i]^2);
}
for (i=1;i <= numberColumn;i++)
{
norm[i]=(($i-avg[i])/std[i])
}
}
{
print $1,$norm[2],3,4,$norm[5]
}
It will be easier to read the file twice:
awk -F, -v OFS=, '
NR==FNR { # 1st pass: accumulate values
if (FNR > 1) {
sx2 += $2 # sum of col2
sxx2 += $2 * $2 # sum of col2^2
sx5 += $5 # sum of col5
sxx5 += $5 * $5 # sum of col5^2
n++ # count of samples
}
next
}
FNR==1 { # 2nd pass, 1st line: calc means and stdevs
ave2 = sx2 / n # mean of col2
var2 = sxx2 / (n - 1) - ave2 * ave2 * n / (n - 1)
if (var2 < 0) var2 = 0 # avoid rounding error
sd2 = sqrt(var2) # stdev of col2
ave5 = sx5 / n
var5 = sxx5 / (n - 1) - ave5 * ave5 * n / (n - 1)
if (var5 < 0) var5 = 0
sd5 = sqrt(var5)
print # print the header line
}
FNR>1 {
if (sd2 > 0) $2 = ($2 - ave2) / sd2
if (sd5 > 0) $5 = ($5 - ave5) / sd5
print
}
' input_file.csv input_file.csv
Output:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-0.704361,critical,radio/tv,-0.750328
1 - 200 DM,1.14459,repaid,radio/tv,1.13527
,-0.440225,critical,education,-0.384939
Please note the calculated values differ from your expected result.
thousands of rows isn't all that big a file for awk : might as well load it in all at once - here i created a 23.6 mn rows synthetic version of it (tested on both gawk and mawk) -
while overall performance is similar to other solutions, this code avoids having to explicitly list the input file twice to perform its equivalent of 2-pass processing
INPUT
rows = 23,622,127. | UTF8 chars = 799192890. | bytes = 799192890.
1 checking_balance,months_loan_duration,credit_history,purpose,amount
2 < 0 DM,889,critical,luna,758.61
3 ,150,critical,terra,1823.93
4 1 - 200 DM,883,repaid,stablecoin,2525.55
5 1 - 200 DM,65,repaid,terra,2405.67
6 < 0 DM,9,critical,luna,4059.34
7 < 0 DM,201,critical,stablecoin,5043
8 1 - 200 DM,549,repaid,terra,471.92
9 < 0 DM,853,critical,stablecoin,422.78
10 < 0 DM,659,critical,luna,684.94
CODE
# gawk profile, created Tue May 24 04:11:02 2022
'function abs(_) {
return \
+_<-_?-_:_
} BEGIN {
split(_____=(_=length(FS = RS = "^$"))+_,____,"")
}
END {
1 gsub("\n", ",&")
1 FS = "["(OFS= ",")"]"
1 $!_ = $!( __ = _)
1 __+= --NF
23622126 while ((_____+_) < (__-=_)) {
23622126 ____[___=_____] += ($__)^_
23622126 ____[ -—___ ] += ($__)
23622126 ____[___ * _] += -_^!_
23622126 ____[___-=+_] += ($(__-=_+_^!_))
23622126 ____[ ++___ ] += ($__)^_
}
1 ___ = (__=-____[_+_+_])-_^!_
1 RS = -(abs((____[(_)]/___-(((NR=____[+_^!+_]/__)^_)*__/___)))^_^(_/-_)
___ = -(abs((____[_+_]/___-(((RT=____[_+_^!_]/__)^_)*__/___)))^_^(_/-_)
1 ORS = "\n"
1 gsub(ORS, "")
1 OFS = ","
1 print $(_^=_<_), $(__=++_), $++_, $++_, $++_
1 OFMT = "%."(__*__+!(__=NF-__-__))"f"
23622126 while (++_ <= __) {
23622126 print $_, (NR-$++_)/RS, $++_, $++_, (RT-$++_)/___
}
}'
OUTPUT
out9: 837MiB 0:00:28 [29.2MiB/s] [29.2MiB/s] [ <=> ]
in0: 762MiB 0:00:00 [2.95GiB/s] [2.95GiB/s] [======>] 100%
( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )
26.98s user 1.58s system 99% cpu 28.681 total
23622127 878032266 878032266 testfile_stdnorm_test_004.txt_out.txt
1 checking_balance,months_loan_duration,credit_history,purpose,amount
2 < 0 DM,1.2000,critical,luna,-1.2939
3 ,-1.2949,critical,terra,-0.6788
4 1 - 200 DM,1.1798,repaid,stablecoin,-0.2737
5 1 - 200 DM,-1.5818,repaid,terra,-0.3429
6 < 0 DM,-1.7709,critical,luna,0.6119
7 < 0 DM,-1.1227,critical,stablecoin,1.1798
8 1 - 200 DM,0.0522,repaid,terra,-1.4594
9 < 0 DM,1.0785,critical,stablecoin,-1.4878
ALTERNATE SOLUTION OPTIMIZED FOR SMALLER INPUTS (e.g. up to 10^6 (1 mn) rows)
# gawk profile, created Tue May 24 06:19:24 2022
# BEGIN rule(s)
BEGIN {
1 __ = (FS = RS = "^$") * (ORS = "")
}
# END rule(s)
END {
1 _ = $__
1 gsub("[\n][,]","\n_,",_)
1 sub("^.+amount\n","",_)+gsub("[,][0-9.+-]+[,\n]", "\3&\1", _)
1 _____ = "[^0-9.+-]+"
1 gsub("^" (_____) "|\1[^\1\3]+\3","",_)
1 _____ = __ = split(_,___,_____)
1048575 while (-(--__) < +__) {
1048575 ___["_"] += _=___[(__)]
1048575 ___["="] += _*_
1048575 ___["~"] += _=___[--__]
1048575 ___["^"] += _*_
1048575 ___[":"]++
}
1 _ = (__=___[":"])-(____ ^= _<_)
1 ++____
1 ___["}"] = -(abs((___["^"]/_)-(((___["{"] = ___["~"] / __)^____)*__/_)))^____^(-(_^(!_)))
1 ___[")"] = -(abs((___["="]/_)-(((___["("] = ___["_"] / __)^____)*__/_)))^____^(-(_^(!_)))
1 if (_ < _) {
for (_ in ___) {
print "debug", _, ___[_]
}
}
1 ____ = split($(_ < _), ______, ORS = "\n")
1 _ = index(FS = "[" (OFS = ",") "]", OFS)
1 print ______[_ ^ (! _)]
1048574 for (__ += __ ^= _ < _; __ < ____; __++) {
1048574 print sprintf("%.*s%s,%+.*f,%s,%s,%+.*f", ! __, $! _ = ______[__], $(_ ~ _), _ + _, (___["{"] - $_) / ___["}"], $++_, $(--_ + _), _ + _, (___["("] - $NF) / ___[")"])
}
}
# Functions, listed alphabetically
2 function abs(_)
{
2 return (+_ < -_ ? -_ : _)
}
PERFORMANCE OF SOLUTION # 2 : End-to-End 2.57 secs for 2^20 rows
rows = 1048575. | UTF8 chars = 39912117. | bytes = 39912117.
( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )
2.46s user 0.13s system 100% cpu 2.573 total

How to Pivot Data Using AWK

From:
DT X Y Z
10 75 0 3
20 100 1 6
30 125 2 9
To:
DT ID VALUE
10 X 75
20 Y 0
30 Z 3
10 X 100
20 Y 1
30 Z 6
10 X 125
20 Y 2
30 Z 9
it's done
#my original dataset is separated by "," and have 280 cols
tempfile=dataset.csv;
col_count=`head -n1 $tempfile | tr -cd "," | wc -c`;
col_count=`expr $col_count + 1`;
for i in `seq 4 $col_count`; do
echo $i;
pt="{print \$"$i"}";
col_name=`head -n 1 $tempfile | sed s'/ //'g | awk -F"," "$pt"`;
awk -F"," -v header="DT,ID,$col_name" -f st.awk $tempfile | awk 'NR>1 {print substr($0,index($0,$1))",'"$col_name"'"}' | sed 's/ //g' >> New$tempfile;
done;
# file st.awk:
# the code below was found on some stackoverflow page, with some minor changes
BEGIN {
# Parse headers into an assoc array h
split(header, a, ",")
for(i in a) {
h[a[i]]=2
}
}
# Find the column numbers in the first line of a file
FNR==1{
split("", cols) # This will re-init cols
for(i=1;i<=NF;i++) {
if($i in h) {
cols[i]=1
}
}
next
}
# Print those columns on all other lines
{
res = ""
for(i=1;i<=NF;i++) {
if(i in cols) {
s = res ? OFS : ""
res = res "," $i
}
}
if (res) {
print res
}
}
You can try this awk (MAWK Version 1.2)
Your data can been 5x5 or more
mawk -v OFS='\t' '
NR==1 {
nbfield=(NF-1)
for(i=1;i<NF;i++)
ID[i]=$(i+1)
print $1 OFS "ID" OFS "VALUE"
next
}
{
numrecord=((NR-1)%nbfield)
numrecord = numrecord ? numrecord : nbfield
for(i=0;i<=nbfield;i++)
val[ID[i],numrecord]=$(i+1)
}
numrecord==nbfield {
for(i=1;i<=nbfield;i++)
for(j=1;j<=nbfield;j++)
print val[ID[0],j] OFS ID[j] OFS val[ID[j],i]
}
' infile
Input:
-- ColOne ColTwo ColThr
RowOne A B C D E
RowTwo F G H I J
RowThr K L M N O
RowFor P Q R S T
RowFiv U V W X Y
Output:
RowNbr | ColNbr | RowColVal
------ | ------ | ---------
RowOne | ColOne | A
RowOne | ColTwo | B
RowOne | ColThr | C
RowTwo | ColOne | F
RowTwo | ColTwo | G
RowTwo | ColThr | H
RowThr | ColOne | K
RowThr | ColTwo | L
RowThr | ColThr | M
Pivot script:
# pivot a table
BEGIN { # before processing innput lines, emit output header
OFS = " | " # set the output field-separator
fmtOutDtl = "%6s | %-6s | %-9s" "\n" # set the output format for all detail lines: InpRowHdr, InpColHdr, InpVal
fmtOutHdr = "%6s | ColNbr | RowColVal" "\n" # set the output format for the header line
strOutDiv = "------ | ------ | ---------" # set the divider line
print "" # emit blank line before output
} # done with output header
NR == 1 { # when we are on the innput header line / the first row
FldCnt = ( NF - 1 ) # number of columns to process is number of fields on this row, except for the first val
for( idxCol = 1; idxCol < NF; idxCol++ ) # scan col numbers after the first, ignoring the first val
ColHds[ idxCol ] = $( idxCol + 1 ) # store the next col-val as this ColHdr
printf( fmtOutHdr, "RowNbr" ) # emit header line: RowNbr-header, innput column headers
print strOutDiv # emit divider row after header line and before data lines
next # skip to the next innput row
} # done with first innput row
{ # for each body innput row
RecNbr = ( ( NR - 1 ) % FldCnt ) # get RecNum for this row: ( RecNum - 1 ) Mod [number of fields]: zero-based / 0..[number_of_cols-1]
RecNbr = RecNbr ? RecNbr : FldCnt # promote from zero-based to one-based: 0 => [number of fields]: one -based / 1..[number_of_cols ]
for( idxCol = 0; idxCol <= FldCnt; idxCol++ ) # scan col numbers including the first
Rws[ ColHds[ idxCol ], RecNbr ] = $( idxCol + 1 ) # store this row+col val in this Row position under this ColHdr
} # done with this body innput row
RecNbr == FldCnt { # when we are on the last innput row that we are processing (lines beyond FldCnt are not emitted)
for( idxCol = 1; idxCol <= FldCnt; idxCol++ ) { # scan col numbers after the first
for( idxRow = 1; idxRow <= FldCnt; idxRow++ ) { # scan row numbers after the first, up to number of cols
printf( fmtOutDtl \
,Rws[ ColHds[ 0 ] , idxCol ] \
, ColHds[ idxRow ] \
,Rws[ ColHds[ idxRow ] , idxCol ] ) # emit innput rowHdr, colHdr, row+col val
} # done scanning row numbers
print "" # emit a blank line after each innput row
} # done scanning col numbers
} # done with the last innput row
END { # after processing innput lines
} # do nothing

Compare consecutive columns of a file and obtain the number of matched elements

I want to compare consecutive columns of a file and return the number of matched elements. I would prefer to use shell scripting or awk. Here is a sample bash/AWK script that I am trying to use.
#!/bin/bash
for i in 3 4 5 6 7 8 9
do
for j in 3 4 5 6 7 8 9
do
`awk "$i == $j" phased.txt | wc -l`
done
done
I have a file of size 147189*828 and I want to compare each columns and return the number of matched elements in a 828*828 matrix(A similarity matrix).
This would be fairly easy in MATLAB, but, it takes a long time to load huge files. I can compare two columns and return the number of matched elements with the following awk command:
awk '$3==$4' phased.txt | wc -l
but would need some help to do it for the entire file.
A snippet of the data that I'm working on is:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
................................................................................
After comparing I will compute a 6*6 matrix in this case: containing the matching percentage of these columns.
In bash, variables need a $ to be interpreted, so your awk "$i == $j" phased.txt | wc -l will be evaluated as awk "3 == 4" phased.txt | wc -l; then, because of your backticks (`), the shell will try to execute that as a command. To get awk to see $3 == $4, you need to add \$: awk "\$$i == \$$j" phased.txt | wc -l.
#!/bin/bash
for i in 3 4 5 6 7 8 9
do
for j in 3 4 5 6 7 8 9
do
awk "\$$i == \$$j" phased.txt | wc -l
done
done
Though you'll probably want to show which combination you're evaluating:
#!/bin/bash
for i in 3 4 5 6 7 8 9
do
for j in 3 4 5 6 7 8 9
do
echo "$i $j: $(awk "\$$i == \$$j" phased.txt | wc -l)"
done
done
You could actually just do the count in awk directly
#!/bin/bash
for i in 3 4 5 6 7 8 9
do
for j in 3 4 5 6 7 8 9
do
echo "$i $j: $(awk "\$$i == \$$j {count++}; END{print count}" phased.txt)"
done
done
Finally, you could just do the whole thing in awk; it'll almost certainly be faster but to be honest it's not that much cleaner: [UNTESTED]
#!/usr/bin/env awk -f
{
for (i = 3; i <= 9; i++) {
for (j = 3; j <= 9; j++) {
if ($i == $j) {
counts[i, j]++
}
}
}
}
END {
for (i = 3; i <= 9; i++) {
for (j = 3; j <= 9; j++) {
printf "%d = %d: %d\n", i, j, counts[i, j]
}
}
}

Calculating sum of gradients with awk

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.
awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file
If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Grouping the rows of a text file based on 2 columns

I have a text file like this:
1 abc 2
1 rgt 2
1 yhj 2
3 gfk 4
5 kji 6
3 plo 4
3 vbn 4
5 olk 6
I want to group the rows on the basis of first and second column like this:
1 abc,rgt,yhj 2
3 gfk,plo,ybn 4
5 kji,olk 6
such that I can see what are the values of col2 for a particular pair of col1, col3.
How can I do this using shell script?
This should do it :
awk -F " " '{ a[$1" "$3]=a[$1" "$3]$2","; }END{ for (i in a)print i, a[i]; }' file.txt | sed 's/,$//g' | awk -F " " '{ tmp=$3;$3=$2;$2=tmp;print }' |sort
Just using awk:
#!/usr/bin/env awk -f
{
k = $1 "\x1C" $3
if (k in a2) {
a2[k] = a2[k] "," $2
} else {
a1[k] = $1
a2[k] = $2
a3[k] = $3
b[++i] = k
}
}
END {
for (j = 1; j <= i; ++j) {
k = b[j]
print a1[k], a2[k], a3[k]
}
}
One line:
awk '{k=$1"\x1C"$3;if(k in a2){a2[k]=a2[k]","$2}else{a1[k]=$1;a2[k]=$2;a3[k]=$3;b[++i]=k}}END{for(j=1;j<=i;++j){k=b[j];print a1[k],a2[k],a3[k]}}' file
Output:
1 abc,rgt,yhj 2
3 gfk,plo,vbn 4
5 kji,olk 6

Resources