AWK - How to do the sorting? - sorting

In awk, how can I do this:
1303361997;15;67.067014
1303361997;5;51.529837
1303361997;14;47.036197
1303361997;3;44.064681
1303361997;6;37.632831
1303361997;23;24.990078
1303361997;24;26.750984
1303361998;15;67.074100
1303361998;5;51.522981
1303361998;14;47.028185
1303361998;3;44.056715
1303361998;6;37.638584
1303361998;23;24.987800
1303361998;24;26.756648
When number in second columns is absent this date should be replace by zero in output file.
First place is the number of the first column. The values ​​of the second column of data to determine the position of the third column in the output file. The first column each time it may begin with different values. Desired output, by sorting first and second columns:
1303361997;0;0;44.064681;0;51.529837;37.632831;0;0;0;0;0;0;0;47.036197;67.067014;0;0;0;0;0;0;0;24.990078;26.750984;
1303361998;0;0;44.056715;0;51.522981;37.638584;0;0;0;0;0;0;0;47.028185;67.074100;0;0;0;0;0;0;0;24.987800;26.756648;

$ cat tst.awk
BEGIN { FS=";" }
NR == 1 {
for (i=1;i<=2;i++) {
min[i] = max[i] = $i
}
}
{
val[$1,$2] = $3
keys[$1]
for (i=1;i<=2;i++) {
min[i] = ($i < min[i] ? $i : min[i])
max[i] = ($i > max[i] ? $i : max[i])
}
}
END {
for (r=min[1];r<=max[1];r++) {
if (r in keys) {
printf "%d",r
for (c=1;c<=max[2];c++) {
printf ";%s", ((r,c) in val ? val[r,c] : 0)
}
print ";"
}
}
}
$
$ cat file
1303361997;15;67.067014
1303361997;5;51.529837
1303361997;14;47.036197
1303361997;3;44.064681
1303361997;6;37.632831
1303361997;23;24.990078
1303361997;24;26.750984
1303361998;15;67.074100
1303361998;5;51.522981
1303361998;14;47.028185
1303361998;3;44.056715
1303361998;6;37.638584
1303361998;23;24.987800
1303361998;24;26.756648
$
$ awk -f tst.awk file
1303361997;0;0;44.064681;0;51.529837;37.632831;0;0;0;0;0;0;0;47.036197;67.067014;0;0;0;0;0;0;0;24.990078;26.750984;
1303361998;0;0;44.056715;0;51.522981;37.638584;0;0;0;0;0;0;0;47.028185;67.074100;0;0;0;0;0;0;0;24.987800;26.756648;

Related

How to Split a Bash Array into Multiple columns in a Markdown table?

I am trying to split a Bash array into multiple columns in order to display as a table in a Markdown file.
I have searched around for a quick one-liner to do this using Bash, AWK and other languages. I know about the column command, but I can't save the output to a variable or file (stdout). I know you can loop the array, extracting values into separate chunks, but there must be an quicker, more efficient way.
keywords.md
awk
accessibility
bash
behat
c++
cache
d3.js
dates
engineering
elasticsearch
...
columns.sh
local data="$(sort "keywords.md")" # read contents of file
local data=($data) # split contents into an array
local table="||||||\n" # create markdown table header
table="${table}|---|---|---|---|---|"
local numColumns=5
# split data into five columns and append to $table variable
I am trying to get this result.
||||||
|---|---|---|---|---|
|awk|bash|c++|d3.js|engineering
|accessibility|behat|cache|dates|elasticsearch
result from column command
Here's the general approach:
$ cat tst.awk
BEGIN {
numCols = (numCols ? numCols : 5)
OFS = "|"
}
{
colNr = (NR - 1) % numCols + 1
if ( colNr == 1 ) {
numRows++
}
vals[numRows,colNr] = $0
}
END {
hdr2 = OFS
for (colNr=1; colNr<=numCols; colNr++) {
hdr2 = hdr2 "---" OFS
}
hdr1 = hdr2
gsub(/-/,"",hdr1)
print hdr1 ORS hdr2
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "|"
for (colNr=1; colNr<=numCols; colNr++) {
val = vals[rowNr,colNr]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file
||||||
|---|---|---|---|---|
|awk|accessibility|bash|behat|c++
|cache|d3.js|dates|engineering|elasticsearch
but it obviously doesn't output the columns in the order you asked for in your question as I don't understand how you arrive at that order.
Here's a perl version that prints out the values going down by column like in your sample desired output:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my $ncolumns = 5;
# Read the list of values.
my #data;
while (<>) {
chomp;
push #data, $_;
}
# Partition the data up into rows, added down by column
my #columns;
my $nrows = #data / $ncolumns;
##data = sort { $a cmp $b } #data;
while (#data) {
my #c = splice #data, 0, $nrows;
for my $n (0 .. $#c) {
push #{$columns[$n]}, $c[$n];
}
}
# And print them out
say '|' x $ncolumns;
say '|', join('|', ('---') x $ncolumns), '|';
for my $r (0 .. $nrows - 1) {
my #row;
for my $c (0 .. $ncolumns - 1) {
my $item = $columns[$r]->[$c];
push #row, $item if defined $item;
}
push #row, ('')x$ncolumns;
say '|', join('|', #row[0 .. $ncolumns - 1]);
}
Usage:
$ ./table.pl keywords.md
|||||
|---|---|---|---|---|
|awk|bash|c++|d3.js|engineering
|accessibility|behat|cache|dates|elasticsearch

Reading 3 Delimited Files passed as Parameters to a AWK Script and parse the 3 Files and store in 3 different arrays

File 1:
A|sam|2456|8901
B|kam|5678|9000
C|pot|4567|8000
File 2:
X|ter|2456|8901
Y|mar|5678|9000
Z|poi|4567|8000
File 3:
Column1|Column2|Column3|Coumn4
Now i want this 3 Files to be passed as parameters to the GNU Awk Script as below -
awk -f script.awk file1 file2 file3
Script i have written are able to handle only 2 Files but not able to handle the 3rd file. Pleas help.
script.awk
BEGIN { # setup file separator and sorting:
FS=OFS="|"
PROCINFO["sorted_in"]="#ind_str_asc"
}
# skip header lines
FNR == 1 { next }
# store first file
(FNR==NR) { f1[$5]=$0
# skip processing of other rules and
# read the next line from input
next
}
# store second file
{ f2[$5]=$0
if( ! ($5 in f1)) {
f1[$5] = ""
}
}
END {
for( k in f1) {
split( f1[k], arr1, "|")
for( c = 1; c <= length( f1[ k ] ); c++ ) {
print arr1[c]
}
}
for( k in f2) {
split( f2[k], arr2, "|")
for( c = 1; c <= length( f2[ k ] ); c++ ) {
print arr2[c]
}
}
}
}
My Objective is the Read the 3rd File also in the same code in the print in the similar way as the printing is handled in the above code.
Note : Would be good if anyone can keep the similar Code structure as above and just include the reading and printing of the 3rd File.
Your existing code is more complicated than it has to be. It could be written as just:
BEGIN { # setup file separator and sorting:
...
}
# skip header lines
FNR == 1 { next }
ARGIND==1 { f1[$5]=$0; next }
ARGIND==2 { f2[$5]=$0; f1[$5] }
END {
...
}
I assume you can see the obvious extension to add a 3rd file. The above requires GNU awk for ARGIND and PROCINFO[] which you're already using.
You can use ARGV array to process multiple files like this:
function disp() {
for (i=1; i<=NF; i++)
print FILENAME " :: " FNR " :: " $i
print ""
}
BEGIN { # setup file separator and sorting:
FS=OFS="|"
PROCINFO["sorted_in"]="#ind_str_asc"
}
# process first file
ARGV[1] == FILENAME {
disp()
}
# process second file
ARGV[2] == FILENAME {
disp()
}
# process third file
ARGV[3] == FILENAME {
disp()
}

Aggregating csv file in bash script

I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).
With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash
With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort
Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

Need Algorithm to create a auto code from a CSV file

I have "N" columns in a csv file say Hardware,Sensors,Statistics(1,2,3 .....N) as shown below.
Each column has unique xml code that I need to generate with respect to the above table content.
<Hardware A>
<Sensors sen1>
<Stat1>Mean</Stat1>
<Stat2>Avg</Stat2>
<Stat3>Slope</Stat3>
</Sensors sen1>
<Sensors sen2>
<Stat1>Min</Stat1>
<Stat2>Max</Stat2>
<Stat3>Mean</Stat3>
</Sensors sen2>
....
....
</Hardware A>
I need to generate a code similar to above with respect to the table. Can anybody tell an Algorithm to implement this structure using SHELL SCRIPT
It'd be something like this in awk (untested obviously since you didn't provide testable sample input/output):
BEGIN { FS=","; fmt="%s %s>\n" }
NR==1 {
for (i=1;i<=NF;i++) {
tagName[i] = $i
}
next
}
$1 != "" {
if (prev != "") {
printf "</"fmt, tagName[1], prev
}
printf "<"fmt, tagName[1], $1
prev = $1
}
{
printf " <"fmt, tagName[2], $2
for (i=3;i<=NF;i++) {
printf " <%s>%s</%s>\n", tagName[i], $i, tagName[i]
}
printf " </"fmt, tagName[2], $2
}
END {
if (prev != "") {
printf "</"fmt, tagName[1], prev
}
}

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!
Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}
I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'
This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.
This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Resources