How to define new commands or macros in awk - shell

I like to define a new command that wraps an existing awk command, such as print. However, I do not want to use a function:
#wrap command with function
function warn(text) { print text > "/dev/stderr" }
NR%1e6 == 0 {
warn("processed rows: "NR)
}
Instead, I like to define a new command that can be invoked without brackets:
#wrap command with new command ???
define warn rest... { print rest... > "/dev/stderr" }
NR%1e6 == 0 {
warn "processed rows: "NR
}
One solution I can imagine is using a preprocessor and maybe setting up the shebang of the awk script nicely to invoke this preproccessor followed by awk. However, I was more hoping for a pure awk solution.
Note: The solution should also work in mawk, which I use, because it is much faster than vanilla GNU/awk.
Update: The discussion revealed that gawk (GNU/awk) can be quite fast and mawk is not required.

You cannot do this within any awk and you cannot do it robustly outside of awk without writing an awk language parser and by that point you may as well write your own awk-like command which then would actually no longer really be awk in as much as it would not behave the same as any other command by that name.
It is odd that you refer to GNU awk as "vanilla" when it has many more useful features than any other currently available awk while mawk is simply a stripped down awk optimized for speed which is only necessary in very rare circumstances.

Looking at Mawk's source I see that commands are special and cannot be added at runtime. From kw.c:
keywords[] =
{
{ "print", PRINT },
{ "printf", PRINTF },
{ "do", DO },
{ "while", WHILE },
{ "for", FOR },
{ "break", BREAK },
{ "continue", CONTINUE },
{ "if", IF },
{ "else", ELSE },
{ "in", IN },
{ "delete", DELETE },
{ "split", SPLIT },
{ "match", MATCH_FUNC },
{ "BEGIN", BEGIN },
{ "END", END },
{ "exit", EXIT },
{ "next", NEXT },
{ "nextfile", NEXTFILE },
{ "return", RETURN },
{ "getline", GETLINE },
{ "sub", SUB },
{ "gsub", GSUB },
{ "function", FUNCTION },
{ (char *) 0, 0 }
};
You could add a new command by patching Mawk's C code.

I created a shell wrapper script called cppawk which combines the C preprocessor (from GCC) with Awk.
BSD licensed, it comes with a man page, regression tests and simple install instructions.
Normally, the C preprocessor creates macros that look like functions; but using certain control flow tricks, which work in Awk also much as they do in C, we can pull off minor miracles of syntactic sugar:
function __warn(x)
{
print x
return 0
}
#define warn for (__w = 1; __w; __w = __warn(__x)) __x =
NR % 5 == 0 {
warn "processed rows: "NR
}
Run:
$ cppawk -f warn.cwk
a
b
c
d
e
processed rows: 5
f
g
h
i
j
processed rows: 10
k
Because the entire for trick is in a single line of code, we could use the __LINE__ symbol to make the hidden variables quasi-unique:
function __warn(x)
{
print x
return 0
}
#define xcat(a, b, c) a ## b ## c
#define cat(a, b, c) xcat(a, b, c)
#define uq(sym) cat(__, __LINE__, sym)
#define warn for (uq(w) = 1; uq(w); uq(w) = __warn(uq(x))) uq(x) =
NR % 5 == 0 {
warn "processed rows: "NR
}
The expansion is:
$ cppawk --prepro-only -f warn.cwk
# 1 "<stdin>"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "<stdin>"
function __warn(x)
{
print x
return 0
}
NR % 5 == 0 {
for (__13w = 1; __13w; __13w = __warn(__13x)) __13x = "processed rows: "NR
}
The u() macro interpolated 13 into the variables because warn is called on line 13.
Hope you like it.
PS, maybe don't do this, but find some less hacky way of using cppawk.
You can use C99/GNUC variadic macros, for instance:
#define warn(...) print __VA_ARGS__ >> "/dev/stderr"
NR % 5 == 0 {
warn("processed rows:", NR)
}
We made a humble print wrapper which redirects to standard error.It seems like nothing, yet you can't do that with an Awk function: not without making it a one-argument function and passing the value of an expression which catenates everything.

Related

Compress ranges of ranges of numbers in bash

I have a csv file named "ranges.csv", which contains:
start_range,stop_range
9702220000,9702220999
9702222000,9702222999
9702223000,9702223999
9750000000,9750000999
9750001000,9750001999
9750002000,9750002999
I am trying to combine the ranges where the stop_range=start_range-1 and output the result in another csv file named "ranges2.csv". So the output will be:
9702220000,9702220999
9702222000,9702223999
9750000000,9750002999
Moreover, I need to know how many ranges contains a compress range (example: for the new range 9750000000,9750002999 I need to know that before the compression there were 3 ranges). This information will help me to create a new csv file named "ranges3.csv" which should contain only the range with the most ranges inside it (the most comprehensive area):
9750000000,9750002999
I was thinking about something like this:
if (stop_range = start_range-1)
new_stop_range = start_range-1
But I am not very smart and I am new to bash scripting.
I know how to output the results in another file but the function for what I need gives me headaches.
I think this does the trick:
#!/bin/bash
awk '
BEGIN { FS = OFS = ","}
NR == 2 {
start = $1; stop = $2; i = 1
}
NR > 2 {
if ($1 == (stop + 1)) {
i++;
stop = $2
} else {
if (++i > max) {
maxr = start "," stop;
max = i
}
start = $1
i = 0
}
stop = $2
}
END {
if (++i > max) {
maxr = start "," stop;
}
print maxr
}
' ranges.csv
Assuming your ranges are sorted, then this code gives you the merged ranges only:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e; b=e="" }
($1==e+1){ e=$2; next }
{ b=$1; e=$2 }
END { print b,e }' file
Below you get the same but with the range count:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e,c; b=e=c="" }
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { print b,e,c }' file
If you want the largest one, you can sort on the third column. I don't want to make a rule to give the range with the most counts, as there might be multiple.
If you really only want all the ranges with the maximum merge:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){
a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
b=e=c=""
}
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
print a[m]
}' file

How I make a list of missing integer from a sequence using bash

I have a file let's say files_190911.csv whose contents are as follows.
EDR_MPU023_09_20190911080534.csv.gz
EDR_MPU023_10_20190911081301.csv.gz
EDR_MPU023_11_20190911083544.csv.gz
EDR_MPU023_14_20190911091405.csv.gz
EDR_MPU023_15_20190911105513.csv.gz
EDR_MPU023_16_20190911105911.csv.gz
EDR_MPU024_50_20190911235332.csv.gz
EDR_MPU024_51_20190911235400.csv.gz
EDR_MPU024_52_20190911235501.csv.gz
EDR_MPU024_54_20190911235805.csv.gz
EDR_MPU024_55_20190911235937.csv.gz
EDR_MPU025_24_20190911000050.csv.gz
EDR_MPU025_25_20190911000155.csv.gz
EDR_MPU025_26_20190911000302.csv.gz
EDR_MPU025_29_20190911000624.csv.gz
I want to make a list of missing sequence from those using bash script.
Every MPUXXX has its own sequence. So there are multiple series of sequences in that file.
The datetime for missing list will use from previous sequence.
From the sample above, the result will be like this.
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
It would be simpler if there were only a single sequence.
So I can use something like this.
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}'
But I know this can't be used for multiple sequence.
EDITED (Thanks #Cyrus!)
AWK is your friend:
#!/usr/bin/awk
BEGIN {
FS="[^0-9]*"
last_seq = 0;
next_serial = 0;
}
{
cur_seq = $2;
cur_serial = $3;
if (cur_seq != last_seq) {
last_seq = cur_seq;
ts = $4
prev = cur_serial;
} else {
if (cur_serial == next_serial) {
ts = $4;
} else {
for (i = next_serial; i < cur_serial; i++) {
print "EDR_MPU" last_seq "_" i "_" ts ".csv.gz"
}
}
}
next_serial = cur_serial + 1;
}
And then you do:
$ < files_190911.csv awk -f script.awk
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
The assignment to FS= splits lines by the regex. The rest program detects holes in sequences and prints them with the appropriate timestamp.

golang flag hidden options from print defaults

this is my actual code :
package main
import (
"flag"
)
var loadList = ""
var threads = 50
var skip = 0
func main() {
//defaults variables
flag.StringVar(&loadList, "f", "", "load links list file (required)")
flag.IntVar(&threads,"t", 50, "run `N` attempts in parallel threads")
flag.IntVar(&skip, "l", 0, "skip first `n` lines of input")
flag.Parse()
flag.PrintDefaults()
}
and this is output :
-f string
load links list file (required)
-l n
skip first n lines of input
-t N
run N attempts in parallel threads (default 50)
i want hide from printdefaults -l and -t, how i can do this ?
There might be multiple ways of doing this. An easy one would be to use VisitAll:
func VisitAll(fn func(*Flag))
In the function you pass you can decide whether or not to output a flag based on any of the exported fields of Flag.
Example:
flag.VisitAll(func(f *flag.Flag) {
if f.Name == "l" || f.Name == "t" {
return
}
fmt.Println("Flag: ", f)
})
Run it at: https://play.golang.org/p/rsrKgWeAQf

awk: next is illegal inside a function

I have a short shell function to convert human readable byte units into an integer of bytes, so, e.g.,
10m to 10000000
4kb to 4000
1kib to 1024
2gib to 2147483648
Here is the code:
dehumanise() {
for v in "$#"
do
echo $v | awk \
'BEGIN{IGNORECASE = 1}
function printpower(n,b,p) {printf "%u\n", n*b^p; next}
/[0-9]$/{print $1;next};
/K(B)?$/{ printpower($1, 10, 3)};
/M(B)?$/{ printpower($1, 10, 6)};
/G(B)?$/{ printpower($1, 10, 9)};
/T(B)?$/{ printpower($1, 10, 12)};
/Ki(B)?$/{printpower($1, 2, 10)};
/Mi(B)?$/{printpower($1, 2, 20)};
/Gi(B)?$/{printpower($1, 2, 30)};
/Ti(B)?$/{printpower($1, 2, 40)}'
done
}
I found the code also somewhere on the internet and I am not so confident with awk. The function worked fine until I re-installed my MacBook a few days ago. Now it throws an error
awk: next is illegal inside a function at source line 2 in function printpower
context is
function printpower(n,b,p) {printf "%u\n", n*b^p; >>> next} <<<
As far as I understand, next is used in awk to directly end the record. Hence in this case it would end the awk statement as it only has one input.
I tried to move the next statement simply behind printpower(...);next.
But this causes the function to give no output at all.
Could someone please help me repair the awk statement?
# awk --version
awk version 20121220
macOS awk version
solved
The no output thing was probably an issue with the macOS awk version. I installed and replaced it with gawk:
brew install gawk
brew link --overwrite gawk
Now it works fine without the next statement.
Software design fundamentals - avoid inversion of control. In this case you don't want some subordinate function suddenly taking charge of your whole processing control flow and IT deciding "screw you all, I'm deciding to jump to the next record". So yes, don't put next inside a function! Having said that, POSIX doesn't say you cannot use next in a function but neither does it explicitly say you can so some awk implementations (apparently the one you are using) have decided to disallow it while gawk and some other awks allow it.
You also have gawk-specific code in your script (IGNORECASE) so it will ONLY work with gawk anyway.
Here's how to really write your script to work in any awk:
awk '
{ $0=tolower($0); b=p=0 }
/[0-9]$/ { b = 1; p = 1 }
/kb?$/ { b = 10; p = 3 }
/mb?$/ { b = 10; p = 6 }
/gb?$/ { b = 10; p = 9 }
/tb?$/ { b = 10; p = 12 }
/kib$/ { b = 2; p = 10 }
/mib$/ { b = 2; p = 20 }
/gib$/ { b = 2; p = 30 }
/tib$/ { b = 2; p = 40 }
p { printf "%u\n", $2*b^p }
'
You can add ; next after every p assignment in the main body if you like but it won't affect the output, just improve the efficiency which would matter if your input was thousands of lines long.
As the message says, you can't use next in a function. You have to place it after each function call:
/KB?$/ { printpower($1, 10, 3); next; }
/MB?$/ { printpower($1, 10, 6); next; }
...
But you can just let awk test the remaining patterns (no next anywhere) if you don't mind the extra CPU cycles. Note that the parentheses around B are redundant and I have removed them.
$ dehumanise 1000MiB 19Ki
1048576000
19456
You could use a control variable in your function and check the value of the variable to decide to use next in the main routine.
# MAIN
{
myfunction(test)
if (result == 1) next
# result is not 1, just continue
# more statements
}
function myfunction(a) {
# default result is 0
result = 0
# some test
if ($1 ~ /searchterm/) {
result = 1
}
}

My awk user function isn't working in a bash script

I am trying to write an awk script in ubuntu as a non-admin user. It takes four terminal statements and throws them into variables. Those variables then are sent to a function I made and it spits out an average number and prints it.
Here is my script:
#!/usr/bin/gawk -f
BEGIN{
one = ARGV[1];
two = ARGV[2];
three = ARGV[3];
four = ARGV[4];
function average_funct(one, two, three, four)
{
total = one + two;
total = total + three;
total = total + four;
average = total / 4;
return average;
}
print("The average of these numbers is " average_funct(one, two, three, four));
}
To run it I have been using this:
./myaverage4 2 7 4 3
Which results in this error message:
gawk: ./myaverage4:9: function average_funct(one, two, three, four)
gawk: ./myaverage4:9: ^ syntax error
gawk: ./myaverage4:15: return average;
gawk: ./myaverage4:15: ^ `return' used outside function context
If anyone could help me figure out the problem that would be awesome.
You can't declare a function inside the BEGIN section or any other action block. Move it outside of all action blocks.
function foo() { ... }
BEGIN { foo() }
I assume you have some reason for writing your code the way you did rather than the more obvious and adaptable to any number of arguments:
function average_funct(arr, total, cnt)
{
for (cnt=1; cnt in arr; cnt++) {
total += arr[cnt]
}
return (--cnt ? total / cnt : 0)
}
BEGIN {
print "The average of these numbers is", average_funct(ARGV)
}

Resources