How to normalize the values of specific columns of a csv with awk? - bash

I have a csv with several variables and I would like to normalize only some specific columns using the standard deviation.
The value minus the mean of the variable divided by the standard deviation of the variable.
The file is comma separated and the transformations needs to be done only with awk to the variables months_loan_duration and amount.
The input would look like this but with a thousand rows:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,6,critical,radio/tv,1169.53
1 - 200 DM,48,repaid,radio/tv,5951.78
,12,critical,education,2096.23
And the output would be like this:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-1.236,critical,radio/tv,-0.745
1 - 200 DM,2.248,repaid,radio/tv,0.95
,-0.738,critical,education,-0.417
So far I have tried the following unsuccessfully:
#! /usr/bin/awk -f
BEGIN{FS=","; OFS=",";numberColumn=NF}
NR!=1
{
for(i=1;i <= numberColumn;i++)
{
total[i]+=$i;
totalSquared[i]+=$i^2;
}
for (i=1;i <= numberColumn;i++)
{
avg[i]=total[i]/(NR-1);
std[i]=sqrt((totalSquared[i]/(NR-1))-avg[i]^2);
}
for (i=1;i <= numberColumn;i++)
{
norm[i]=(($i-avg[i])/std[i])
}
}
{
print $1,$norm[2],3,4,$norm[5]
}

It will be easier to read the file twice:
awk -F, -v OFS=, '
NR==FNR { # 1st pass: accumulate values
if (FNR > 1) {
sx2 += $2 # sum of col2
sxx2 += $2 * $2 # sum of col2^2
sx5 += $5 # sum of col5
sxx5 += $5 * $5 # sum of col5^2
n++ # count of samples
}
next
}
FNR==1 { # 2nd pass, 1st line: calc means and stdevs
ave2 = sx2 / n # mean of col2
var2 = sxx2 / (n - 1) - ave2 * ave2 * n / (n - 1)
if (var2 < 0) var2 = 0 # avoid rounding error
sd2 = sqrt(var2) # stdev of col2
ave5 = sx5 / n
var5 = sxx5 / (n - 1) - ave5 * ave5 * n / (n - 1)
if (var5 < 0) var5 = 0
sd5 = sqrt(var5)
print # print the header line
}
FNR>1 {
if (sd2 > 0) $2 = ($2 - ave2) / sd2
if (sd5 > 0) $5 = ($5 - ave5) / sd5
print
}
' input_file.csv input_file.csv
Output:
checking_balance,months_loan_duration,credit_history,purpose,amount
< 0 DM,-0.704361,critical,radio/tv,-0.750328
1 - 200 DM,1.14459,repaid,radio/tv,1.13527
,-0.440225,critical,education,-0.384939
Please note the calculated values differ from your expected result.

thousands of rows isn't all that big a file for awk : might as well load it in all at once - here i created a 23.6 mn rows synthetic version of it (tested on both gawk and mawk) -
while overall performance is similar to other solutions, this code avoids having to explicitly list the input file twice to perform its equivalent of 2-pass processing
INPUT
rows = 23,622,127. | UTF8 chars = 799192890. | bytes = 799192890.
1 checking_balance,months_loan_duration,credit_history,purpose,amount
2 < 0 DM,889,critical,luna,758.61
3 ,150,critical,terra,1823.93
4 1 - 200 DM,883,repaid,stablecoin,2525.55
5 1 - 200 DM,65,repaid,terra,2405.67
6 < 0 DM,9,critical,luna,4059.34
7 < 0 DM,201,critical,stablecoin,5043
8 1 - 200 DM,549,repaid,terra,471.92
9 < 0 DM,853,critical,stablecoin,422.78
10 < 0 DM,659,critical,luna,684.94
CODE
# gawk profile, created Tue May 24 04:11:02 2022
'function abs(_) {
return \
+_<-_?-_:_
} BEGIN {
split(_____=(_=length(FS = RS = "^$"))+_,____,"")
}
END {
1 gsub("\n", ",&")
1 FS = "["(OFS= ",")"]"
1 $!_ = $!( __ = _)
1 __+= --NF
23622126 while ((_____+_) < (__-=_)) {
23622126 ____[___=_____] += ($__)^_
23622126 ____[ -—___ ] += ($__)
23622126 ____[___ * _] += -_^!_
23622126 ____[___-=+_] += ($(__-=_+_^!_))
23622126 ____[ ++___ ] += ($__)^_
}
1 ___ = (__=-____[_+_+_])-_^!_
1 RS = -(abs((____[(_)]/___-(((NR=____[+_^!+_]/__)^_)*__/___)))^_^(_/-_)
___ = -(abs((____[_+_]/___-(((RT=____[_+_^!_]/__)^_)*__/___)))^_^(_/-_)
1 ORS = "\n"
1 gsub(ORS, "")
1 OFS = ","
1 print $(_^=_<_), $(__=++_), $++_, $++_, $++_
1 OFMT = "%."(__*__+!(__=NF-__-__))"f"
23622126 while (++_ <= __) {
23622126 print $_, (NR-$++_)/RS, $++_, $++_, (RT-$++_)/___
}
}'
OUTPUT
out9: 837MiB 0:00:28 [29.2MiB/s] [29.2MiB/s] [ <=> ]
in0: 762MiB 0:00:00 [2.95GiB/s] [2.95GiB/s] [======>] 100%
( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )
26.98s user 1.58s system 99% cpu 28.681 total
23622127 878032266 878032266 testfile_stdnorm_test_004.txt_out.txt
1 checking_balance,months_loan_duration,credit_history,purpose,amount
2 < 0 DM,1.2000,critical,luna,-1.2939
3 ,-1.2949,critical,terra,-0.6788
4 1 - 200 DM,1.1798,repaid,stablecoin,-0.2737
5 1 - 200 DM,-1.5818,repaid,terra,-0.3429
6 < 0 DM,-1.7709,critical,luna,0.6119
7 < 0 DM,-1.1227,critical,stablecoin,1.1798
8 1 - 200 DM,0.0522,repaid,terra,-1.4594
9 < 0 DM,1.0785,critical,stablecoin,-1.4878
ALTERNATE SOLUTION OPTIMIZED FOR SMALLER INPUTS (e.g. up to 10^6 (1 mn) rows)
# gawk profile, created Tue May 24 06:19:24 2022
# BEGIN rule(s)
BEGIN {
1 __ = (FS = RS = "^$") * (ORS = "")
}
# END rule(s)
END {
1 _ = $__
1 gsub("[\n][,]","\n_,",_)
1 sub("^.+amount\n","",_)+gsub("[,][0-9.+-]+[,\n]", "\3&\1", _)
1 _____ = "[^0-9.+-]+"
1 gsub("^" (_____) "|\1[^\1\3]+\3","",_)
1 _____ = __ = split(_,___,_____)
1048575 while (-(--__) < +__) {
1048575 ___["_"] += _=___[(__)]
1048575 ___["="] += _*_
1048575 ___["~"] += _=___[--__]
1048575 ___["^"] += _*_
1048575 ___[":"]++
}
1 _ = (__=___[":"])-(____ ^= _<_)
1 ++____
1 ___["}"] = -(abs((___["^"]/_)-(((___["{"] = ___["~"] / __)^____)*__/_)))^____^(-(_^(!_)))
1 ___[")"] = -(abs((___["="]/_)-(((___["("] = ___["_"] / __)^____)*__/_)))^____^(-(_^(!_)))
1 if (_ < _) {
for (_ in ___) {
print "debug", _, ___[_]
}
}
1 ____ = split($(_ < _), ______, ORS = "\n")
1 _ = index(FS = "[" (OFS = ",") "]", OFS)
1 print ______[_ ^ (! _)]
1048574 for (__ += __ ^= _ < _; __ < ____; __++) {
1048574 print sprintf("%.*s%s,%+.*f,%s,%s,%+.*f", ! __, $! _ = ______[__], $(_ ~ _), _ + _, (___["{"] - $_) / ___["}"], $++_, $(--_ + _), _ + _, (___["("] - $NF) / ___[")"])
}
}
# Functions, listed alphabetically
2 function abs(_)
{
2 return (+_ < -_ ? -_ : _)
}
PERFORMANCE OF SOLUTION # 2 : End-to-End 2.57 secs for 2^20 rows
rows = 1048575. | UTF8 chars = 39912117. | bytes = 39912117.
( pvE 0.1 in0 < "${f}" | LC_ALL=C mawk2 ; )
2.46s user 0.13s system 100% cpu 2.573 total

Related

sum by year and insert missing entries with 0

I have a report for year-month entries like below
201703 5
201708 10
201709 20
201710 40
201711 80
201712 100
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201902 10
I need to sum the year-month entries by year and print after all the months for that particular year. The year-month can have missing entries for any month(s).
For those months the a dummy value (0) should be inserted.
Required output:
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
I can get the summary of year by using below command.
awk ' { c=substr($1,0,4); if(c!=p) { print p,s ;s=0} s=s+$2 ; p=c ; print } ' ym.dat
But, how to insert entries for the missing ones?.
Also the last entry should not exceed current (system time) year-month. i.e for this specific example, dummy values should not be inserted for 201904..201905.. etc. It should just stop with 201903
You may use this awk script mmyy.awk:
{
rec[$1] = $2;
yy=substr($1, 1, 4)
mm=substr($1, 5, 2) + 0
ys[yy] += $2
}
NR == 1 {
fm = mm
fy = yy
}
END {
for (y=fy; y<=cy; y++)
for (m=1; m<=12; m++) {
# print previous years sums
if (m == 1 && y-1 in ys)
print y-1, ys[y-1]
if (y == fy && m < fm)
continue;
else if (y == cy && m > cm)
break;
# print year month with values or 0 if entry is missing
k = sprintf("%d%02d", y, m)
printf "%d%02d %d\n", y, m, (k in rec ? rec[k] : 0)
}
print y-1, ys[y-1]
}
Then call it as:
awk -v cy=$(date '+%Y') -v cm=$(date '+%m') -f mmyy.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With GNU awk for strftime():
$ cat tst.awk
NR==1 {
begDate = $1
endDate = strftime("%Y%m")
}
{
val[$1] = $NF
year = substr($1,1,4)
}
year != prevYear { prt(); prevYear=year }
END { prt() }
function prt( mth, sum, date) {
if (prevYear != "") {
for (mth=1; mth<=12; mth++) {
date = sprintf("%04d%02d", prevYear, mth)
if ( (date >= begDate) && (date <=endDate) ) {
print date, val[date]+0
sum += val[date]
delete val[date]
}
}
print prevYear, sum+0
}
}
.
$ awk -f tst.awk file
201703 5
201704 0
201705 0
201706 0
201707 0
201708 10
201709 20
201710 40
201711 80
201712 100
2017 255
201801 0
201802 0
201803 25
201804 50
201805 50
201806 150
201807 300
201808 200
201809 0
201810 0
201811 0
201812 0
2018 775
201901 0
201902 10
201903 0
2019 10
With other awks you'd just pass in endDate using awk -v endDate=$(date +'%Y%m') '...'
Perl to the rescue!
perl -lane '$start ||= $F[0];
$Y{substr $F[0], 0, 4} += $F[1];
$YM{$F[0]} = $F[1];
END { for $y (sort keys %Y) {
for $m (1 .. 12) {
$m = sprintf "%02d", $m;
next if "$y$m" lt $start;
print "$y$m ", $YM{$y . $m} || 0;
last if $y == 1900 + (localtime)[5]
&& (localtime)[4] < $m;
}
print "$y ", $Y{$y} || 0;
}
}' -- file
-n reads the input line by line
-l removes newlines from input and adds them to output
-a splits each line on whitespace into the #F array
substr extracts the year from the YYYYMM date. Hashes %Y and %YM use dates and keys and the counts as values. That's why the year hash uses += which adds the value to the already accumulated one.
The END block is evaluated after the input has been exhausted.
It just iterates over the years stored in the hash, the range 1 .. 12 is used for month to insert the zeroes (the || operator prints it).
next and $start skips the months before the start of the report.
last is responsible for skipping the rest of the current year.
The following awk script will do what you expect. The idea is:
store data in an array
print and sum only when the year changes
This gives:
# function that prints the year starting
# at month m1 and ending at m2
function print_year(m1,m2, s,str) {
s=0
for(i=(m1+0); i<=(m2+0); ++i) {
str=y sprintf("%0.2d",i);
print str, a[str]+0; s+=a[str]
}
print y,s
}
# This works for GNU awk, replace for posix with a call as
# awk -v stime=$(date "+%Y%m") -f script.awk file
BEGIN{ stime=strftime("%Y%m") }
# initializer on first record
(NR==1){ y=substr($1,1,4); m1=substr($1,5) }
# print intermediate year
(substr($1,1,4) != y) {
print_year(m1,12)
y=substr($1,1,4); m1="01";
delete a
}
# set array value and keep track of last month
{a[$1]=$2; m2=substr($1,5)}
# check if entry is still valid (past stime or not)
($1 > stime) { exit }
# print all missing years full
# print last year upto system time month
END {
for (;y<substr(stime,1,4)+0;y++) { print_year(m1,12); m1=1; m2=12; }
print_year(m1,substr(stime,5))
}
Nice question, btw. Friday afternoon brain frier. Time to head home.
In awk. The optional endtime and its value are brought in as arguments:
$ awk -v arg1=201904 -v arg2=100 ' # optional parameters
function foo(ym,v) {
while(p<ym){
y=substr(p,1,4) # get year from previous round
m=substr(p,5,2)+0 # get month
p=y+(m==12) sprintf("%02d",m%12+1) # December magic
if(m==12)
print y,s[y] # print the sums (delete maybe?)
print p, (p==ym?v:0) # print yyyymm and 0/$2
}
}
{
s[substr($1,1,4)]+=$2 # sums in array, year index
}
NR==1 { # handle first record
print
p=$1
}
NR>1 {
foo($1,$2)
}
END {
if(arg1)
foo(arg1,arg2)
print y=substr($1,1,4),s[y]+arg2
}' file
Tail from output:
2018 775
201901 0
201902 10
201903 0
201904 100
2019 110

How to Pivot Data Using AWK

From:
DT X Y Z
10 75 0 3
20 100 1 6
30 125 2 9
To:
DT ID VALUE
10 X 75
20 Y 0
30 Z 3
10 X 100
20 Y 1
30 Z 6
10 X 125
20 Y 2
30 Z 9
it's done
#my original dataset is separated by "," and have 280 cols
tempfile=dataset.csv;
col_count=`head -n1 $tempfile | tr -cd "," | wc -c`;
col_count=`expr $col_count + 1`;
for i in `seq 4 $col_count`; do
echo $i;
pt="{print \$"$i"}";
col_name=`head -n 1 $tempfile | sed s'/ //'g | awk -F"," "$pt"`;
awk -F"," -v header="DT,ID,$col_name" -f st.awk $tempfile | awk 'NR>1 {print substr($0,index($0,$1))",'"$col_name"'"}' | sed 's/ //g' >> New$tempfile;
done;
# file st.awk:
# the code below was found on some stackoverflow page, with some minor changes
BEGIN {
# Parse headers into an assoc array h
split(header, a, ",")
for(i in a) {
h[a[i]]=2
}
}
# Find the column numbers in the first line of a file
FNR==1{
split("", cols) # This will re-init cols
for(i=1;i<=NF;i++) {
if($i in h) {
cols[i]=1
}
}
next
}
# Print those columns on all other lines
{
res = ""
for(i=1;i<=NF;i++) {
if(i in cols) {
s = res ? OFS : ""
res = res "," $i
}
}
if (res) {
print res
}
}
You can try this awk (MAWK Version 1.2)
Your data can been 5x5 or more
mawk -v OFS='\t' '
NR==1 {
nbfield=(NF-1)
for(i=1;i<NF;i++)
ID[i]=$(i+1)
print $1 OFS "ID" OFS "VALUE"
next
}
{
numrecord=((NR-1)%nbfield)
numrecord = numrecord ? numrecord : nbfield
for(i=0;i<=nbfield;i++)
val[ID[i],numrecord]=$(i+1)
}
numrecord==nbfield {
for(i=1;i<=nbfield;i++)
for(j=1;j<=nbfield;j++)
print val[ID[0],j] OFS ID[j] OFS val[ID[j],i]
}
' infile
Input:
-- ColOne ColTwo ColThr
RowOne A B C D E
RowTwo F G H I J
RowThr K L M N O
RowFor P Q R S T
RowFiv U V W X Y
Output:
RowNbr | ColNbr | RowColVal
------ | ------ | ---------
RowOne | ColOne | A
RowOne | ColTwo | B
RowOne | ColThr | C
RowTwo | ColOne | F
RowTwo | ColTwo | G
RowTwo | ColThr | H
RowThr | ColOne | K
RowThr | ColTwo | L
RowThr | ColThr | M
Pivot script:
# pivot a table
BEGIN { # before processing innput lines, emit output header
OFS = " | " # set the output field-separator
fmtOutDtl = "%6s | %-6s | %-9s" "\n" # set the output format for all detail lines: InpRowHdr, InpColHdr, InpVal
fmtOutHdr = "%6s | ColNbr | RowColVal" "\n" # set the output format for the header line
strOutDiv = "------ | ------ | ---------" # set the divider line
print "" # emit blank line before output
} # done with output header
NR == 1 { # when we are on the innput header line / the first row
FldCnt = ( NF - 1 ) # number of columns to process is number of fields on this row, except for the first val
for( idxCol = 1; idxCol < NF; idxCol++ ) # scan col numbers after the first, ignoring the first val
ColHds[ idxCol ] = $( idxCol + 1 ) # store the next col-val as this ColHdr
printf( fmtOutHdr, "RowNbr" ) # emit header line: RowNbr-header, innput column headers
print strOutDiv # emit divider row after header line and before data lines
next # skip to the next innput row
} # done with first innput row
{ # for each body innput row
RecNbr = ( ( NR - 1 ) % FldCnt ) # get RecNum for this row: ( RecNum - 1 ) Mod [number of fields]: zero-based / 0..[number_of_cols-1]
RecNbr = RecNbr ? RecNbr : FldCnt # promote from zero-based to one-based: 0 => [number of fields]: one -based / 1..[number_of_cols ]
for( idxCol = 0; idxCol <= FldCnt; idxCol++ ) # scan col numbers including the first
Rws[ ColHds[ idxCol ], RecNbr ] = $( idxCol + 1 ) # store this row+col val in this Row position under this ColHdr
} # done with this body innput row
RecNbr == FldCnt { # when we are on the last innput row that we are processing (lines beyond FldCnt are not emitted)
for( idxCol = 1; idxCol <= FldCnt; idxCol++ ) { # scan col numbers after the first
for( idxRow = 1; idxRow <= FldCnt; idxRow++ ) { # scan row numbers after the first, up to number of cols
printf( fmtOutDtl \
,Rws[ ColHds[ 0 ] , idxCol ] \
, ColHds[ idxRow ] \
,Rws[ ColHds[ idxRow ] , idxCol ] ) # emit innput rowHdr, colHdr, row+col val
} # done scanning row numbers
print "" # emit a blank line after each innput row
} # done scanning col numbers
} # done with the last innput row
END { # after processing innput lines
} # do nothing

Calculating sum of gradients with awk

I have a file that contains 4 columns such as:
A B C D
1 2 3 4
10 20 30 40
100 200 300 400
.
.
.
I can calculate gradient of columns B to D versus A such as following commands:
NR>1{print $0,($2-b)/($1-a)}{a=$1;b=$2}' file
How can I print sum of gradients as the 5th column in the file? The results should be:
A B C D sum
1 2 3 4 1+2+3+4=10
10 20 30 40 (20-2)/(10-1)+(30-3)/(10-1)+(40-4)/(10-1)=9
100 200 300 400 (200-20)/(100-10)+(300-30)/(100-10)+(400-40)/(100-10)=9
.
.
.
awk 'NR == 1 { print $0, "sum"; next } { if (NR == 2) { sum = $1 + $2 + $3 + $4 } else { t = $1 - a; sum = ($2 - b) / t + ($3 - c) / t + ($4 - d) / t } print $0, sum; a = $1; b = $2; c = $3; d = $4 }' file
Output:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
With ... | column -t:
A B C D sum
1 2 3 4 10
10 20 30 40 9
100 200 300 400 9
Update:
#!/usr/bin/awk -f
NR == 1 {
print $0, "sum"
next
}
{
sum = 0
if (NR == 2) {
for (i = 1; i <= NF; ++i)
sum += $i
} else {
t = $1 - a[1]
for (i = 2; i <= NF; ++i)
sum += ($i - a[i]) / t
}
print $0, sum
for (i = 1; i <= NF; ++i)
a[i] = $i
}
Usage:
awk -f script.awk file
If you apply the same logic to the first line of numbers as you do to the rest, taking the initial value of each column as 0, you get 9 as the result of the sum (as it was in your question originally). This approach uses a loop to accumulate the sum of the gradient from the second field up to the last one. It uses the fact that on the first time round, the uninitialised values in the array a evaluate to 0:
awk 'NR==1 { print $0, "sum"; next }
{
s = 0
for(i=2;i<=NF;++i) s += ($i-a[i])/($1-a[1]) # accumulate sum
for(i=1;i<=NF;++i) a[i] = $i # fill array to be used for next iteration
print $0, s
}' file
You can pack it all onto one line if you want but remember to separate the statements with semicolons. It's also slightly shorter to only use a single for loop with an if:
awk 'NR==1{print$0,"sum";next}{s=0;for(i=1;i<=NF;++i)if(i>1)s+=($i-a[i])/($1-a[1]);a[i]=$i;print$0,s}' file
Output:
A B C D sum
1 2 3 4 9
10 20 30 40 9
100 200 300 400 9

Sort values and output the indices of their sorted columns

I've got a file that looks like:
20 30 40
80 70 60
50 30 40
Each column represents a procedure. I want to know how the procedures did for each row. My ideal output would be
3 2 1
1 2 3
1 3 2
i.e. in row 1, the third column had the highest value, followed by the second, then the first smallest (this can be reversed, doesn't matter).
How would I do this?
I'd do it with some other Unix tools (read, cat, sort, cut, tr, sed, and bash of course):
while read line
do
cat -n <(echo "$line" | sed 's/ /\n/g') | sort -r -k +2 | cut -f1 | tr '\n' ' '
echo
done < input.txt
The output looks like this:
3 2 1
1 2 3
1 3 2
Another solution using Python:
$ python
Python 2.7.6 (default, Jan 26 2014, 17:25:18)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> with open('file.txt') as f:
... lis=[x.split() for x in f]
...
>>> for each in lis:
... each = [i[0] + 1 for i in sorted(enumerate(each), key=lambda x:x[1], reverse=True)]
... print ' '.join([str(item) for item in each])
...
3 2 1
1 2 3
1 3 2
Using Gnu Awk version 4:
$ awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_desc" }
{
split($0,a," ")
for (i in a) printf "%s%s", i,OFS
print ""
}' file
3 2 1
1 2 3
1 3 2
If you have GNU awk then you can do something like:
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
$ cat file
20 30 40
0.923913 0.913043 0.880435 0.858696 0.826087 0.902174 0.836957 0.880435
80 70 60
50 30 40
awk '{
y = a = x = j = i = 0;
delete tmp;
delete num;
delete ind;
for(i = 1; i <= NF; i++) {
num[$i, i] = i
}
x = asorti(num)
for(y = 1; y <= x; y++) {
split(num[y], tmp, SUBSEP)
ind[++j] = tmp[2]
}
for(a = x; a >= 1; a--) {
printf "%s%s", ind[a],(a==1?"\n":" ")
}
}' file
3 2 1
1 2 6 8 3 4 7 5
1 2 3
1 3 2
Solution via perl
#!/usr/bin/perl
open(FH,'<','/home/chidori/input.txt') or die "Can't open file$!\n";
while(my $line=<FH>){
chomp($line);
my #unsorted_array=split(/\s/,$line);
my $count=scalar #unsorted_array;
my #sorted_array = sort { $a <=> $b } #unsorted_array;
my %hash=map{$_ => $count--} #sorted_array;
foreach my $value(#unsorted_array){
print "$hash{$value} ";
}
print "\n";
}

Grouping the rows of a text file based on 2 columns

I have a text file like this:
1 abc 2
1 rgt 2
1 yhj 2
3 gfk 4
5 kji 6
3 plo 4
3 vbn 4
5 olk 6
I want to group the rows on the basis of first and second column like this:
1 abc,rgt,yhj 2
3 gfk,plo,ybn 4
5 kji,olk 6
such that I can see what are the values of col2 for a particular pair of col1, col3.
How can I do this using shell script?
This should do it :
awk -F " " '{ a[$1" "$3]=a[$1" "$3]$2","; }END{ for (i in a)print i, a[i]; }' file.txt | sed 's/,$//g' | awk -F " " '{ tmp=$3;$3=$2;$2=tmp;print }' |sort
Just using awk:
#!/usr/bin/env awk -f
{
k = $1 "\x1C" $3
if (k in a2) {
a2[k] = a2[k] "," $2
} else {
a1[k] = $1
a2[k] = $2
a3[k] = $3
b[++i] = k
}
}
END {
for (j = 1; j <= i; ++j) {
k = b[j]
print a1[k], a2[k], a3[k]
}
}
One line:
awk '{k=$1"\x1C"$3;if(k in a2){a2[k]=a2[k]","$2}else{a1[k]=$1;a2[k]=$2;a3[k]=$3;b[++i]=k}}END{for(j=1;j<=i;++j){k=b[j];print a1[k],a2[k],a3[k]}}' file
Output:
1 abc,rgt,yhj 2
3 gfk,plo,vbn 4
5 kji,olk 6

Resources