Can text be sorted twice? - bash

I have an awk array that aggregates bytes up and downloaded. I can sort the output by either bytes down or up and pipe that to head for the top talkers; is it possible to output two sorts using different keys?
zgrep ^1 20211014T00*.gz|awk '{print$3,$11,$6,$(NF-7)}'| awk 'NR>1{bytesDown[$1 " " $2]+=$3;bytesUp[$1 " " $2]+=$4} END {for(i in bytesDown) print bytesDown[i], bytesUp[i], i}'|sort -rn|head
Rather than parsing the source again to get the top uploads, I would like to be able to output the array again to "sort -rnk2|head".
I can see how I'd do it with a scratch file but is it possible/desirable to do it in memory? It's a bash shell on a 2 CPU Linux VM with 4GB of memory.

Your question isn't clear and there's no sample input/output to test with but this MAY be what you're trying to do:
zgrep '^1' 20211014T00*.gz|
awk '
NR > 1 {
key = $3 " " $11
bytesdown[key] += $6
bytesup[key] += $(NF-7)
}
END {
cmd = "sort -rn | head"
for ( key in bytesDown ) {
print bytesDown[key], bytesUp[key], key | cmd
}
close(cmd)
cmd = "sort -rnk2 | head"
for ( key in bytesDown ) {
print bytesDown[key], bytesUp[key], key | cmd
}
close(cmd)
}
'
which could be written more concisely and efficiently as:
zgrep '^1' 20211014T00*.gz|
awk '
NR > 1 {
key = $3 " " $11
bytesdown[key] += $6
bytesup[key] += $(NF-7)
if ( NR == 2 ) {
max_bytesdown_key = key
max_bytesup_key = key
}
else {
if ( bytesdown[key] > bytesdown[max_bytesdown_key] ) {
max_bytesdown_key = key
}
if ( bytesup[key] > bytesup[max_bytesup_key] ) {
max_bytesup_key = key
}
}
}
END {
print bytesdown[max_bytesdown_key], bytesup[max_bytesdown_key], max_bytesdown_key
print bytesdown[max_bytesup_key], bytesup[max_bytesup_key], max_bytesup_key
}
'

Bash allows you to do that with process substitutions. It's not clear what you expect it to do with the data; printing both results to standard output is unlikely to be useful, so I send each to a separate file for later inspection.
zgrep ^1 20211014T00*.gz |
awk '{print$3,$11,$6,$(NF-7)}' |
awk 'NR>1{bytesDown[$1 " " $2]+=$3;bytesUp[$1 " " $2]+=$4}
END {for(i in bytesDown) print bytesDown[i], bytesUp[i], i}' |
tee >(sort -rn | head >first) |
sort -rnk2 | head >second
The double Awks could easily be refactored to a single Awk script.
Something like this?
awk 'NR>1{bytesDown[$3 " " $11]+=$6;bytesUp[$3 " " $11]+=$(NF-7)}
END { for(i in bytesDown) print bytesDown[i], bytesUp[i], i }'

Related

convert table into comma separated in text file using bash

I have a text file like this:
+------------------+------------+----------+
| col_name | data_type | comment |
+------------------+------------+----------+
| _id | bigint | |
| starttime | string | |
+------------------+------------+----------+
how can i get a result like this using bash
(_id bigint, starttime string )
so just the column names and type
#remove first 3 lines
sed -e '1,3d' < columnnames.txt >clean.txt
#remove first character from each line
sed 's/^.//' < clean.txt >clean.txt
#remove last character from each line
sed 's/.$//' < clean.txt >clean.txt
# remove certain characters
sed 's/[+-|]//g' < clean.txt >clean.txt
# remove last line
sed '$ d' < clean.txt >clean.txt
so this is what i have so far, if there is a better implementation let me know!
Something similar, using only awk:
awk -F ' *[|]' 'BEGIN {printf("(")} NR>3 && NF>1 {printf("%s%s%s", NR>4 ? "," : "", $2, $3)} END {printf(" )\n")}' columnnames.txt
# Set the field separator to vertical bar surrounded by any number of spaces.
# BEGIN and END blocks print the opening and closing parens
# The line between skips the header lines and any line starting with '+'
$ awk -F"[[:space:]]*[|][[[:space:]]*" '
BEGIN { printf "%s", "( "}
NR > 3 && $0 !~ /^[+]/ { printf("%s%s %s", c, $2, $3); c = ", " }
END { print " )" }' file
( _id bigint, starttime string )
$ awk -F'[| ]+' 'NR>3 && NF>1{v=v s $2" "$3; s=", "} END{print "("v")"}' file
(_id bigint, starttime string)
I would do this :
cat input.txt \
| tail -n +4 \
| awk -F'[^a-zA-Z_]+' '{ for(i=1;i<=NF;i++) { printf $i" " }}'
Its a little bit shorter.
Another way to implement Diego Torres Milano's solution as a stand-alone awk program:
tableconvert
#!/usr/bin/env -S awk -f
BEGIN {
FS="[[:space:]]*[|][[[:space:]]*"
printf "%s", "( "
}
{
if (FNR <= 3 || match($0, /^[+]/))
next
else {
printf("%s%s %s", c, $2, $3)
c = ", "
}
}
END {
print " )"
}
Make tableconvert an executable:
chmod +x tableconvert
Run tableconvert on intablefile.txt
./tableconvert intablefile.txt
( _id bigint, starttime string )
With added bonus that using FNR instead of NR allow the awk program to process multiple input files as arguments:
./tableconvert infille1.txt infile2.txt infile3.txt ...
A variation on the other answers using awk with the field-separator being the '|' with optional spaces on either side as GNU awk allows, then taking fields 2 and 3 as the fields wanted in each record, and formatting the output as described in the question with the closing " )" provided in the END rule:
$ awk -F' *\\| *' '
NR>3 && $1~/^[+]/{exit} # exit condition first line w/^+
NR==4{$1=$1; printf "(%s %s", $2,$3} # 1st data record is 4
NR>4{$1=$1; printf ", %s %s", $2,$3} # process all remainng records
END{print " )"} # output closing " )"
' table
(_id bigint, starttime string )
(note: if you don't want the two-spaces before the closing ")", just remove them from the print in the END rule)
Rather than using a BEGIN the first record of interest (4) is used to provide the opening "(". Look things over and let me know if you have questions.

UNIX group by two values

I have a file with the following lines (values are separated by ";"):
dev_name;dev_type;soft
name1;ASR1;11.1
name2;ASR1;12.2
name3;ASR1;11.1
name4;ASR3;15.1
I know how to group them by one value, like count of all ASRx, but how can I group it by two values, as for example:
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
another awk
$ awk -F';' 'NR>1 {a[$2]; b[$3]; c[$2,$3]++}
END {for(k in a) {print k;
for(p in b)
if(c[k,p]) print "\t*"p,"-",c[k,p]}}' file
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
$ cat tst.awk
BEGIN { FS=";"; OFS=" - " }
NR==1 { next }
$2 != prev { prt(); prev=$2 }
{ cnt[$3]++ }
END { prt() }
function prt( soft) {
if ( prev != "" ) {
print prev
for (soft in cnt) {
print " *" soft, cnt[soft]
}
delete cnt
}
}
$ awk -f tst.awk file
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
Or if you like pipes....
$ tail +2 file | cut -d';' -f2- | sort | uniq -c |
awk -F'[ ;]+' '{print ($3!=prev ? $3 ORS : "") " *" $4 " - " $2; prev=$3}'
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
try something like
awk -F ';' '
NR==1{next}
{aRaw[$2"-"$3]++}
END {
asorti( aRaw, aVal)
for( Val in aVal) {
split( aVal [Val], aTmp, /-/ )
if ( aTmp[1] != Last ) { Last = aTmp[1]; print Last }
print " " aTmp[2] " " aRaw[ aVal[ Val] ]
}
}
' YourFile
key here is to use 2 field in a array. The END part is more difficult to present the value than the content itself
Using Perl
$ cat bykub.txt
dev_name;dev_type;soft
name1;ASR1;11.1
name2;ASR1;12.2
name3;ASR1;11.1
name4;ASR3;15.1
$ perl -F";" -lane ' $kv{$F[1]}{$F[2]}++ if $.>1;END { while(($x,$y) = each(%kv)) { print $x;while(($p,$q) = each(%$y)){ print "\t\*$p - $q" }}}' bykub.txt
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
$
Yet Another Solution, this one using the always useful GNU datamash to count the groups:
$ datamash -t ';' --header-in -sg 2,3 count 3 < input.txt |
awk -F';' '$1 != curr { curr = $1; print $1 } { print "\t*" $2 " - " $3 }'
ASR1
*11.1 - 2
*12.2 - 1
ASR3
*15.1 - 1
I don't want to encourage lazy questions, but I wrote a solution, and I'm sure someone can point out improvements. I love posting answers on this site because I learn so much. :)
One binary subcall to sort, otherwise all built-in processing. That means using read, which is slow. If your file is large, I'd recommend rewriting the loop in awk or perl, but this will get the job done.
sed 1d groups | # strip the header
sort -t';' -k2,3 > group.srt # pre-sort to collect groupings
declare -i ctr=0 # initialize integer record counter
IFS=';' read x lastA lastB < group.srt # priming read for comparators
printf "$lastA\n\t*$lastB - " # priming print (assumes at least one record)
while IFS=';' read x a b # loop through the file
do if [[ "$lastA" < "$a" ]] # on every MAJOR change
then printf "$ctr\n$a\n\t*$b - " # print total, new MAJOR header and MINOR header
lastA="$a" # update the MAJOR comparator
lastB="$b" # update the MINOR comparator
ctr=1 # reset the counter
elif [[ "$lastB" < "$b" ]] # on every MINOR change
then printf "$ctr\n\t*$b - " # print total and MINOR header
ctr=1 # reset the counter
else (( ctr++ )) # otherwise increment
fi
done < group.srt # feed read from sorted file
printf "$ctr\n" # print final group total at EOF

Add info to output -- obtained from a shell command execution

I have files containing indented lines such as:
table 't'
field 'abc'
field 'def' and #enabled=true
field 'ghi'
table 'u'
I want to transform it to:
table 't'
field 'abc' [info about ABC]
field 'def' [info about DEF] and #enabled=true
field 'ghi' [info about GHI]
table 'u'
where the string between brackets is get from the call of a shell script (get-info, that fetches the definition of terms 'abc', 'def' and 'ghi').
I tried with AWK (via the cmd | getline output mechanism):
awk '$1 == "field" {
$2 = substr($2, 2, length($2) - 2)
cmd = "get-info \"" $2 "\" 2>&1 | head -n 1" # results or error
while (cmd | getline output) {
print $0 " [" output "]";
}
close(cmd)
next
}
// { print $0 }'
but it does not respect the indentation!
How could I fulfil my wish?
It looks like what you're trying to do would be:
$1 == "field" {
cmd = "get-info \"" substr($2,2,length($2)-2) "\" 2>&1" # results or error
if ( (cmd | getline output) > 0 ) {
sub(/^[[:space:]]*[^[:space:]]+[[:space:]]+[^[:space:]]+/,"& ["output"]")
}
close(cmd)
}
{ print }
Note you don't need the head -1, just don't read the output in a loop.
e.g.:
$ cat tst.awk
$1 == "field" {
cmd = "echo \"--->" substr($2,2,length($2)-2) "<---\" 2>&1"
if ( (cmd | getline output) > 0 ) {
sub(/^[[:space:]]*[^[:space:]]+[[:space:]]+[^[:space:]]+/,"& ["output"]")
}
close(cmd)
}
{ print }
$ awk -f tst.awk file
table 't'
field 'abc'
field 'def' [--->def<---] and #enabled=true
field 'ghi'
table 'u'
This is a rare occasion where use of getline is probably appropriate but make sure you read and understand all of the getline caveats at http://awk.info/?tip/getline if you're considering using getline again.

Sorting dates by groups

Here is a sample of my data with 4 columns and comma delimiter.
1,A,2009-01-01,2009-07-15
1,A,2009-07-10,2009-07-12
2,B,2009-01-01,2009-07-15
2,B,2009-07-10,2010-12-15
3,C,2009-01-01,2009-07-15
3,C,2009-07-15,2010-12-15
3,C,2010-12-15,2014-07-07
4,D,2009-06-01,2009-07-15
4,D,2009-07-21,2012-12-15
5,E,2011-04-23,2012-10-19
The first 2 columns are grouped. I want the minimum date from the third column, and the maximum date from the fourth column, for each group.
Then I will pick the first line for each first 2 column combination.
Desired output
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
I have tried the following code, but not working. I get close, but not the max date.
cat exam |sort -t, -nk1 -k2,3 -k4,4r |sort -t, -uk1,2
Would prefer an easy one-liner like above.
sort datafile |
awk -F, -v OFS=, '
{key = $1 FS $2}
key != prev {prev = key; min[key] = $3}
{max[key] = ($4 > max[key]) ? $4 : max[key]}
END {for (key in min) print key, min[key], max[key]}
' |
sort
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
When you pre-sort, you are guaranteed that the minimum col3 date will occur on the first line of a new group. Then you just need to find the maximum col4 date.
The final sort is required because iterating over the keys of an awk hash is unordered. You can do this sorting in (g)awk with:
END {
n = asorti(min, sortedkeys)
for (i=1; i<=n; i++)
print sortedkeys[i], min[sortedkeys[i]], max[sortedkeys[i]]
}
#!/usr/bin/awk -f
BEGIN { FS = OFS = "," }
{
sub(/[[:blank:]]*<br>$/, "")
key = $1 FS $2
if (!a[key]) {
a[key] = $3
b[key] = $4
keys[++k] = key
} else if ($3 < a[key]) {
a[key] = $3
} else if ($4 > b[key]) {
b[key] = $4
}
}
END {
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
}
Usage:
awk -f script.awk file
Output:
1,A,2009-01-01,2009-07-15 <br>
2,B,2009-01-01,2010-12-15 <br>
3,C,2009-01-01,2014-07-07 <br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19 <br>
Of course you can add print statements before and after the loop to print the other two <br>'s:
END {
print "<br>"
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
print "<br>"
}
You want a "one liner" ?
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
uniq -w4
The key idea is to sort the data once by field 3 asc, and independently by field 4 desc. Then you just have to merge corresponding lines (cut and paste). Finally uniq is used to keep only the first row for each pair of identical first two columns. This is the weak point here, as I assume 4 characters max for comparison. You either have to adjust to your needs, or somehow normalize data for those two columns in order to have a fixed width here when using your actual data.
EDIT: A probably better option is to replace uniq by a simple awk filter:
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
awk -F , '$1","$2 != last { print; last=$1","$2 }'
On my system (GNU Linux Debian Wheezy), both produce the same result:
1,A,2009-01-01,2009-07-15<br>
2,B,2009-01-01,2010-12-15<br>
3,C,2009-01-01,2014-07-07<br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19<br>

Doing math on the linux command line

I have a log file from a web server which looks like this;
1908 462
232 538
232 520
232 517
My task is to total column 1 and column 2 in a bash script. My desired output is;
2604 2037
I know of awk or sed which could go a long way to solving my problem but I can't fathom how to actually do it. I've trawled examples on Google but haven't turned up anything useful. Can someone point me in the right direction please?
awk '{a += $1; b += $2} END { print a " " b }' foo.log
(Note the complete lack of error checking.)
EDIT :
Ok, here's a version with error checking:
awk 'BEGIN { ok = 1 } { if (/^ *[0-9]+ +[0-9]+ *$/) { a += $1; b += $2 } else { ok = 0; exit 1 } } END { if (ok) print a, b }' foo.log
If you don't want to accept leading or trailing blanks, delete the two " *"s in the if statement.
But this is big enough that it probably shouldn't be a one-liner:
#!/usr/bin/awk -f
BEGIN {
ok = 1
}
{
if (/^ *[0-9]+ +[0-9]+ *$/) {
a += $1
b += $2
}
else {
ok = 0
exit 1
}
}
END {
if (ok) print a, b
}
There's still no overflow or underflow checking, and it assumes that there will be no signs. The latter is easy enough to fix; the former would be more difficult. (Note that awk uses floating-point internally; if the sum is big enough, it could quietly lose precision.)
Try
awk '{a+=$1;b+=$2} END {print a, b}' file
Here is a non-awk alternative for you:
echo $( cut -f 1 -d " " log_file | tr '\n' + | xargs -I '{}' echo '{}'0 | bc ) $( cut -f 2 -d " " log_file | tr '\n' + | xargs -I '{}' echo '{}'0 | bc )
Make sure you replace log_file with your own file and that file does not have any extra or unnecessary new lines. If you have such lines then we would need to filter those out using a command like the following:
grep -v "^\s*$" log_file
These might work for you:
sed ':a;N;s/ \(\S*\)\n\(\S*\) /+\2 \1+/;$!ba;s/ /\n/p;d' file | bc | paste -sd' '
or
echo $(cut -d' ' -f1 file | paste -sd+ | bc) $(cut -d' ' -f2 file| paste -sd+ |bc)

Resources