awk: next is illegal inside a function - bash

I have a short shell function to convert human readable byte units into an integer of bytes, so, e.g.,
10m to 10000000
4kb to 4000
1kib to 1024
2gib to 2147483648
Here is the code:
dehumanise() {
for v in "$#"
do
echo $v | awk \
'BEGIN{IGNORECASE = 1}
function printpower(n,b,p) {printf "%u\n", n*b^p; next}
/[0-9]$/{print $1;next};
/K(B)?$/{ printpower($1, 10, 3)};
/M(B)?$/{ printpower($1, 10, 6)};
/G(B)?$/{ printpower($1, 10, 9)};
/T(B)?$/{ printpower($1, 10, 12)};
/Ki(B)?$/{printpower($1, 2, 10)};
/Mi(B)?$/{printpower($1, 2, 20)};
/Gi(B)?$/{printpower($1, 2, 30)};
/Ti(B)?$/{printpower($1, 2, 40)}'
done
}
I found the code also somewhere on the internet and I am not so confident with awk. The function worked fine until I re-installed my MacBook a few days ago. Now it throws an error
awk: next is illegal inside a function at source line 2 in function printpower
context is
function printpower(n,b,p) {printf "%u\n", n*b^p; >>> next} <<<
As far as I understand, next is used in awk to directly end the record. Hence in this case it would end the awk statement as it only has one input.
I tried to move the next statement simply behind printpower(...);next.
But this causes the function to give no output at all.
Could someone please help me repair the awk statement?
# awk --version
awk version 20121220
macOS awk version
solved
The no output thing was probably an issue with the macOS awk version. I installed and replaced it with gawk:
brew install gawk
brew link --overwrite gawk
Now it works fine without the next statement.

Software design fundamentals - avoid inversion of control. In this case you don't want some subordinate function suddenly taking charge of your whole processing control flow and IT deciding "screw you all, I'm deciding to jump to the next record". So yes, don't put next inside a function! Having said that, POSIX doesn't say you cannot use next in a function but neither does it explicitly say you can so some awk implementations (apparently the one you are using) have decided to disallow it while gawk and some other awks allow it.
You also have gawk-specific code in your script (IGNORECASE) so it will ONLY work with gawk anyway.
Here's how to really write your script to work in any awk:
awk '
{ $0=tolower($0); b=p=0 }
/[0-9]$/ { b = 1; p = 1 }
/kb?$/ { b = 10; p = 3 }
/mb?$/ { b = 10; p = 6 }
/gb?$/ { b = 10; p = 9 }
/tb?$/ { b = 10; p = 12 }
/kib$/ { b = 2; p = 10 }
/mib$/ { b = 2; p = 20 }
/gib$/ { b = 2; p = 30 }
/tib$/ { b = 2; p = 40 }
p { printf "%u\n", $2*b^p }
'
You can add ; next after every p assignment in the main body if you like but it won't affect the output, just improve the efficiency which would matter if your input was thousands of lines long.

As the message says, you can't use next in a function. You have to place it after each function call:
/KB?$/ { printpower($1, 10, 3); next; }
/MB?$/ { printpower($1, 10, 6); next; }
...
But you can just let awk test the remaining patterns (no next anywhere) if you don't mind the extra CPU cycles. Note that the parentheses around B are redundant and I have removed them.
$ dehumanise 1000MiB 19Ki
1048576000
19456

You could use a control variable in your function and check the value of the variable to decide to use next in the main routine.
# MAIN
{
myfunction(test)
if (result == 1) next
# result is not 1, just continue
# more statements
}
function myfunction(a) {
# default result is 0
result = 0
# some test
if ($1 ~ /searchterm/) {
result = 1
}
}

Related

How can I find out how many lines are between a number and the next occurrence of the same number in a file?

I have large files that each store results from very long calculations. Here's an example of a file where there are results for five time steps; there are problems with the output at the third, fourth, and fifth time steps.
(Please note that I have been lazy and have used the same numbers to represent the results at each time step in my example. In reality, the numbers would be unique at each time step.)
3
i = 1, time = 1.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 2, time = 1.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 3, time = 2.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 (<--Problem: calculation stopped and some numbers are missing)
3
i = 4, time = 2.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709 (Problem: calculation stopped and entire row is missing below)
3
i = 5, time = 3.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770 sdffs (<--Problem: rarely, additional characters can be printed but I figured out how to identify the longest lines in the file and don't have this problem this time)
The problem is that the calculations can fail (and then need to be restarted) as a result is printing to a file. That means that when I try to use the results, I have problems.
My question is, how can I find out when something has gone wrong and the results file has been messed up? The most common problem is that there are not "3" lines of results (plus the header, which is the line where there's i = ...)? If I could find a problem line, I could then delete that time step.
Here is an example of error output I get when trying to use a messed-up file:
Traceback (most recent call last):
File "/mtn/storage/software/languages/anaconda/Anaconda3-2018.12/lib/python3.7/site-packages/aser/io/extxyz.py", line 593, in read_xyz
nentss = int(line)
ValueError: invalid literal for int() with base 10: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pythonPostProcessingCode.py", line 25, in <module>
path = read('%s%s' % (filename, fileext) , format='xyz', index=':') # <--This line tells me that Python cannot read in a particular time step because the formatting is messed up.
I am not experienced with scripting/Awk, etc, so if anyone thinks I have not used appropriate question tags, a heads-up would be welcome. Thank you.
The header plus 330 mean 331 lines of text and so
awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { print RS$0 } }' file > newfile
Explanation:
awk 'BEGIN {
RS="i ="
}
{
split($0,bits,"\n");
if (length(bits)-1==331) {
print RS$0
}
}' file > newfile
Before processing any lines from the file called file, set the record separator equal to "i =". Then, for each record, use, split to split the record ($0) into an array bits based on a new line as the separator. Where the length of the array bits, less 1 is 331 print the record separator plus the record, redirecting the output to a new file called newfile
It sounds like this is what you want:
$ cat tst.awk
/^ i =/ {
prt()
expNumLines = prev + 1
actNumLines = 2
rec = prev RS $0
next
}
NF == 4 {
rec = rec RS $0
actNumLines++
}
{ prev = $0 }
END { prt() }
function prt() {
if ( (actNumLines == expNumLines) && (rec != "") ) {
print "-------------"
print rec
}
}
$ awk -f tst.awk file
-------------
3
i = 3, time = 2.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
-------------
3
i = 5, time = 3.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
Just change the prt() function to do whatever it is you want to do with valid records.
This answer is not really bash-related, but may be of interest if performance is an issue, since you seem to handle very large files.
Considering that you can compile some very basic C programs, you may build this code:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Constants are hardcoded to make the program more readable
// But they could be passed as program argument
const char separator[]="i =";
const unsigned int requiredlines=331;
int main(void) {
char* buffer[331] = { NULL, };
ssize_t buffersizes[331] = { 0, };
size_t n = requiredlines+1; // Ignore lines until the separator is found
char* line = NULL;
size_t len = 0;
ssize_t nbread;
size_t i;
// Iterate through all lines
while ((nbread = getline(&line, &len, stdin)) != -1) {
// If the separator is found:
// - print the record (if valid)
// - reset the record (always)
if (strstr(line, separator)) {
if (n == requiredlines) {
for (i = 0 ; i < requiredlines ; ++i) printf("%s", buffer[i]);
}
n = 0;
}
// Add the line to the buffer, unless too many lines have been read
// (in which case we may discard lines until the separator is found again)
if (n < requiredlines) {
if (buffersizes[n] > nbread) {
strncpy(buffer[n], line, nbread);
buffer[n][nbread] = '\0';
} else {
free(buffer[n]);
buffer[n] = line;
buffersizes[n] = nbread+1;
line = NULL;
len = 0;
}
}
++n;
}
// Don't forget about the last record, if valid
if (n == requiredlines) {
for (i = 0 ; i < requiredlines ; ++i) printf("%s", buffer[i]);
}
free(line);
for (i = 0 ; i < requiredlines ; ++i) free(buffer[i]);
return 0;
}
The program can be compiled like this:
gcc -c prog.c && gcc -o prog prog.o
Then it may be executed like this:
./prog < infile > outfile
To simplify the code, it reads from stdin and outputs to stdout, but that’s more than enough in Bash considering all the options at your disposal to redirect streams. If need be, the code can be adapted to read/write directly from/to files.
I have tested it on a generated file with 10 million lines and compared it to the awk-based solution.
(time awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { printf "%s",RS$0 } }' infile) > outfile
real 0m24.655s
user 0m24.357s
sys 0m0.279s
(time ./prog < infile) > outfile
real 0m1.414s
user 0m1.291s
sys 0m0.121s
With this example it runs approximately 18 times faster than the awk solution. Your mileage may vary (different data, different hardware) but I guess it should always be significantly faster.
I should mention that the awk solution is impressively fast (for a scripted solution, that is). I have first tried to code the solution in C++ and it had similar performance as awk, and sometimes it was even slower than awk.
It's a little bit more difficult to spread the match of a record across 2 lines to try and incorporate the i = ..., but I don't think you actually need to. It looks like a new record can be distinguished by the occurrence of a line with only one column. If that is the case, you could do something like:
awk -v c=330 -v p=1 'function pr(n) {
if( n - p == c) printf "%s", buf;
buf = ""}
NF == 1 { pr(NR); p = NR; c = $1 }
{buf = sprintf("%s%s\n", buf, $0)}
END {pr(NR+1)}' input-file
In the above, whenever a line is seen with a single record, the expectation is that many lines will be in the following record. If that number is not matched, the record is not printed. To avoid that logic, just remove the c = $1 near the end of line 4. The only reason you need the -v c=330 is to enable the removal of that assignment; if you want the single column line to be the line count of the record, you can omit -v c=330.
You can use csplit to grab each record in separate files, if you are allowed to write intermediate files.
csplit -k infile '/i = /' {*}
Then you can see which records are complete and which ones are not using wc -l xx* (note: xx is the default prefix of split files).
Then you can do whatever you want with those records, including listing files that have exactly 331 lines:
wc -l xx* | sed -n 's/^ *331 \(xx.*\)/\1/p'
Should you want to build a new file with all valid records, simply concatenate them:
wc -l xx* | sed -n 's/^ *331 \(xx.*\)/\1/p' | xargs cat > newfile
You can also, among other uses, archive failed records:
wc -l xx* | sed -e '$d' -e '/^ *331 \(xx.*\)/d' | xargs cat > failures

How I make a list of missing integer from a sequence using bash

I have a file let's say files_190911.csv whose contents are as follows.
EDR_MPU023_09_20190911080534.csv.gz
EDR_MPU023_10_20190911081301.csv.gz
EDR_MPU023_11_20190911083544.csv.gz
EDR_MPU023_14_20190911091405.csv.gz
EDR_MPU023_15_20190911105513.csv.gz
EDR_MPU023_16_20190911105911.csv.gz
EDR_MPU024_50_20190911235332.csv.gz
EDR_MPU024_51_20190911235400.csv.gz
EDR_MPU024_52_20190911235501.csv.gz
EDR_MPU024_54_20190911235805.csv.gz
EDR_MPU024_55_20190911235937.csv.gz
EDR_MPU025_24_20190911000050.csv.gz
EDR_MPU025_25_20190911000155.csv.gz
EDR_MPU025_26_20190911000302.csv.gz
EDR_MPU025_29_20190911000624.csv.gz
I want to make a list of missing sequence from those using bash script.
Every MPUXXX has its own sequence. So there are multiple series of sequences in that file.
The datetime for missing list will use from previous sequence.
From the sample above, the result will be like this.
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
It would be simpler if there were only a single sequence.
So I can use something like this.
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}'
But I know this can't be used for multiple sequence.
EDITED (Thanks #Cyrus!)
AWK is your friend:
#!/usr/bin/awk
BEGIN {
FS="[^0-9]*"
last_seq = 0;
next_serial = 0;
}
{
cur_seq = $2;
cur_serial = $3;
if (cur_seq != last_seq) {
last_seq = cur_seq;
ts = $4
prev = cur_serial;
} else {
if (cur_serial == next_serial) {
ts = $4;
} else {
for (i = next_serial; i < cur_serial; i++) {
print "EDR_MPU" last_seq "_" i "_" ts ".csv.gz"
}
}
}
next_serial = cur_serial + 1;
}
And then you do:
$ < files_190911.csv awk -f script.awk
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
The assignment to FS= splits lines by the regex. The rest program detects holes in sequences and prints them with the appropriate timestamp.

My awk user function isn't working in a bash script

I am trying to write an awk script in ubuntu as a non-admin user. It takes four terminal statements and throws them into variables. Those variables then are sent to a function I made and it spits out an average number and prints it.
Here is my script:
#!/usr/bin/gawk -f
BEGIN{
one = ARGV[1];
two = ARGV[2];
three = ARGV[3];
four = ARGV[4];
function average_funct(one, two, three, four)
{
total = one + two;
total = total + three;
total = total + four;
average = total / 4;
return average;
}
print("The average of these numbers is " average_funct(one, two, three, four));
}
To run it I have been using this:
./myaverage4 2 7 4 3
Which results in this error message:
gawk: ./myaverage4:9: function average_funct(one, two, three, four)
gawk: ./myaverage4:9: ^ syntax error
gawk: ./myaverage4:15: return average;
gawk: ./myaverage4:15: ^ `return' used outside function context
If anyone could help me figure out the problem that would be awesome.
You can't declare a function inside the BEGIN section or any other action block. Move it outside of all action blocks.
function foo() { ... }
BEGIN { foo() }
I assume you have some reason for writing your code the way you did rather than the more obvious and adaptable to any number of arguments:
function average_funct(arr, total, cnt)
{
for (cnt=1; cnt in arr; cnt++) {
total += arr[cnt]
}
return (--cnt ? total / cnt : 0)
}
BEGIN {
print "The average of these numbers is", average_funct(ARGV)
}

Expand range of numbers in file

I have a file with delimited integers which I've extracted from elsewhere and dumped into a file. Some lines contain a range, as per the below:
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3-9,10 have problems
Cars 1-5,5-10 are in the depot
Trains 1-10 are on time
Any way to expand the ranges on the text file so that it returns each individual number, with the , delimiter preserved? The text either side of the integers could be anything, and I need it preserved.
Files 1,2,3,4,5,6,7,8,9,10 are OK
Uses 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
I guess this can be done relatively easily with awk, let alone any other scripting language. Any help very much appreciated
You haven't tagged with perl but I'd recommend it in this case:
perl -pe 's/(\d+)-(\d+)/join(",", $1..$2)/ge' file
This substitutes all occurrences of one or more digits, followed by a hyphen, followed by one or more digits. It uses the numbers it has captured to create a list from the first number to the second and joins the list on a comma.
The e modifier is needed here so that an expression can be evaluated in the replacement part of the substitution.
To avoid repeated values and to sort the list, things get a little more complicated. At this point, I'd recommend using a script, rather than a one-liner:
use strict;
use warnings;
use List::MoreUtils qw(uniq);
while (<>) {
s/(\d+)-(\d+)/join(",", $1..$2)/ge;
if (/(.*\s)((\d+,)+\d+)(.*)/) {
my #list = sort { $a <=> $b } uniq split(",", $2);
$_ = $1 . join(",", #list) . $4 . "\n";
}
} continue {
print;
}
After expanding the ranges (like in the one-liner), I've re-parsed the line to extract the list of values. I've used uniq from List::MoreUtils (a core module) to remove any duplicates and sorted the values.
Call the script like perl script.pl file.
A solution using awk:
{
result = "";
count = split($0, fields, /[ ,-]+/, seps);
for (i = 1; i <= count; i++) {
if (fields[i] ~ /[0-9]+/) {
if (seps[i] == ",") {
numbers[fields[i]] = fields[i];
} else if (seps[i] == "-") {
for (j = fields[i] + 1; j <= fields[i+1]; j++) {
numbers[j] = j;
}
} else if (seps[i] == " ") {
numbers[fields[i]] = fields[i];
c = asort(numbers);
for (r = 1; r < c; r++) {
result = result numbers[r] ",";
}
result = result numbers[c] " ";
}
} else {
result = result fields[i] seps[i];
}
}
print result;
}
$ cat tst.awk
match($0,/[0-9,-]+/) {
split(substr($0,RSTART,RLENGTH),numsIn,/,/)
numsOut = ""
delete seen
for (i=1;i in numsIn;i++) {
n = split(numsIn[i],range,/-/)
for (j=range[1]; j<=range[n]; j++) {
if ( !seen[j]++ ) {
numsOut = (numsOut=="" ? "" : numsOut ",") j
}
}
}
print substr($0,1,RSTART-1) numsOut substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
another awk
$ awk '{while(match($0, /[0-9]+-[0-9]+/))
{k=substr($0, RSTART, RLENGTH);
split(k,a,"-");
f=a[1];
for(j=a[1]+1; j<=a[2]; j++) f=f","j;
sub(k,f)}}1' file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
note that the Cars 1-5,5-10 will end up two 5 values when expanded due to overlapping ranges.

How to define new commands or macros in awk

I like to define a new command that wraps an existing awk command, such as print. However, I do not want to use a function:
#wrap command with function
function warn(text) { print text > "/dev/stderr" }
NR%1e6 == 0 {
warn("processed rows: "NR)
}
Instead, I like to define a new command that can be invoked without brackets:
#wrap command with new command ???
define warn rest... { print rest... > "/dev/stderr" }
NR%1e6 == 0 {
warn "processed rows: "NR
}
One solution I can imagine is using a preprocessor and maybe setting up the shebang of the awk script nicely to invoke this preproccessor followed by awk. However, I was more hoping for a pure awk solution.
Note: The solution should also work in mawk, which I use, because it is much faster than vanilla GNU/awk.
Update: The discussion revealed that gawk (GNU/awk) can be quite fast and mawk is not required.
You cannot do this within any awk and you cannot do it robustly outside of awk without writing an awk language parser and by that point you may as well write your own awk-like command which then would actually no longer really be awk in as much as it would not behave the same as any other command by that name.
It is odd that you refer to GNU awk as "vanilla" when it has many more useful features than any other currently available awk while mawk is simply a stripped down awk optimized for speed which is only necessary in very rare circumstances.
Looking at Mawk's source I see that commands are special and cannot be added at runtime. From kw.c:
keywords[] =
{
{ "print", PRINT },
{ "printf", PRINTF },
{ "do", DO },
{ "while", WHILE },
{ "for", FOR },
{ "break", BREAK },
{ "continue", CONTINUE },
{ "if", IF },
{ "else", ELSE },
{ "in", IN },
{ "delete", DELETE },
{ "split", SPLIT },
{ "match", MATCH_FUNC },
{ "BEGIN", BEGIN },
{ "END", END },
{ "exit", EXIT },
{ "next", NEXT },
{ "nextfile", NEXTFILE },
{ "return", RETURN },
{ "getline", GETLINE },
{ "sub", SUB },
{ "gsub", GSUB },
{ "function", FUNCTION },
{ (char *) 0, 0 }
};
You could add a new command by patching Mawk's C code.
I created a shell wrapper script called cppawk which combines the C preprocessor (from GCC) with Awk.
BSD licensed, it comes with a man page, regression tests and simple install instructions.
Normally, the C preprocessor creates macros that look like functions; but using certain control flow tricks, which work in Awk also much as they do in C, we can pull off minor miracles of syntactic sugar:
function __warn(x)
{
print x
return 0
}
#define warn for (__w = 1; __w; __w = __warn(__x)) __x =
NR % 5 == 0 {
warn "processed rows: "NR
}
Run:
$ cppawk -f warn.cwk
a
b
c
d
e
processed rows: 5
f
g
h
i
j
processed rows: 10
k
Because the entire for trick is in a single line of code, we could use the __LINE__ symbol to make the hidden variables quasi-unique:
function __warn(x)
{
print x
return 0
}
#define xcat(a, b, c) a ## b ## c
#define cat(a, b, c) xcat(a, b, c)
#define uq(sym) cat(__, __LINE__, sym)
#define warn for (uq(w) = 1; uq(w); uq(w) = __warn(uq(x))) uq(x) =
NR % 5 == 0 {
warn "processed rows: "NR
}
The expansion is:
$ cppawk --prepro-only -f warn.cwk
# 1 "<stdin>"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "<stdin>"
function __warn(x)
{
print x
return 0
}
NR % 5 == 0 {
for (__13w = 1; __13w; __13w = __warn(__13x)) __13x = "processed rows: "NR
}
The u() macro interpolated 13 into the variables because warn is called on line 13.
Hope you like it.
PS, maybe don't do this, but find some less hacky way of using cppawk.
You can use C99/GNUC variadic macros, for instance:
#define warn(...) print __VA_ARGS__ >> "/dev/stderr"
NR % 5 == 0 {
warn("processed rows:", NR)
}
We made a humble print wrapper which redirects to standard error.It seems like nothing, yet you can't do that with an Awk function: not without making it a one-argument function and passing the value of an expression which catenates everything.

Resources