Here's the deal: I need to read a specific amount of bytes, which will be processed later on. I've encountered a strange phenomenon though, and I couldn't wrap my head around it. Maybe someone else? :)
NOTE: The following code-examples are slimmed-down versions just to show the effect!
A way of doing this, at least with gawk, is to set RS to a catch-all regex, and then use RT to see, what has been matched:
RS="[\x00-\xFF]"
Then, quite simply use the following awk-script:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
}
{
print RT
}
This is working fine:
$ echo "abcdef" | awk -f bug.awk
abcdef
However, I'll need several files, to be accessed, so I am forced to use getline:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
while (getline)
{
print RT
}
}
This is seemingly equivalent to the above, but when running it, there is a nasty surprise:
$ echo "abcdef" | awk -f bug.awk
abc
This means, for some reason, getline is encountering the EOF condition 3 bytes early. So, did I miss something, that I should know about the internals of bash/Linux buffering, or did I find a dreadful bug?
Just for the record: I am using GNU Awk 4.0.1 on Ubuntu 14.04 LTS (Linux 3.13.0/36)
Any tips, guys?
UPDATE: I am using getline, because I have previously read and preprocessed the file(s), and stored in file(s) /dev/shm/. Then I'd need to do a few final processing steps. The above examples are just bare minimum scripts, to show the problem.
Seems like this is a manifestation of the bug reported here, which (if I understand it correctly) has the effect of terminating the getline prematurely when close to the end of input, rather than at the end of input.
The bug fixes seem to have been committed on May 9 and May 10, 2014, so if you can upgrade to version 4.1 it should fix the problem.
If all you need to do is read a specified number of bytes, I'd suggest that awk is not the ideal tool, regardless of bugs. Instead, you might consider one of the following two standard utilities, which will be able to do the work rather more efficiently:
head -c $count
or
dd bs=$count count=1
With dd you can explicitly set the input file (if=PATH) and output file (of=PATH) if stdin/stdout are not appropriate. With head you can specify the input file as a positional parameter, but the output always goes to stdout.
See man head and man dd for more details.
Fortunately, using GNU Awk 4.1.3 (on a Mac), your program with getline works as expected:
echo "abcdef" | gawk 'BEGIN{ORS="";OFS="";RS="[\x00-\xFF]";
while (getline) {print RT}}'
abcdef
$ gawk --version
GNU Awk 4.1.3, API: 1.1
Related
I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.
with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv
This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.
I have a pipe delimited feed file which has several fields. Since I only need a few, I thought of using awk to capture them for my testing purposes. However, I noticed that printf changes the value if I use "%d". It works fine if I use "%s".
Feed File Sample:
[jaypal:~/Temp] cat temp
302610004125074|19769904399993903|30|15|2012-01-13 17:20:02.346000|2012-01-13 17:20:03.307000|E072AE4B|587244|316|13|GSM|1|SUCC|0|1|255|2|2|0|213|2|0|6|0|0|0|0|0|10|16473840051|30|302610|235|250|0|7|0|0|0|0|0|10|54320058002|906|722310|2|0||0|BELL MOBILITY CELLULAR, INC|BELL MOBILITY CELLULAR, INC|Bell Mobility|AMX ARGENTINA SA.|Claro aka CTI Movil|CAN|ARG|
I am interested in capturing the second column which is 19769904399993903.
Here are my tests:
[jaypal:~/Temp] awk -F"|" '{printf ("%d\n",$2)}' temp
19769904399993904 # Value is changed
However, the following two tests works fine -
[jaypal:~/Temp] awk -F"|" '{printf ("%s\n",$2)}' temp
19769904399993903 # Value remains same
[jaypal:~/Temp] awk -F"|" '{print $2}' temp
19769904399993903 # Value remains same
So is this a limit of "%d" of not able to handle long integers. If thats the case why would it add one to the number instead of may be truncating it?
I have tried this with BSD and GNU versions of awk.
Version Info:
[jaypal:~/Temp] gawk --version
GNU Awk 4.0.0
Copyright (C) 1989, 1991-2011 Free Software Foundation.
[jaypal:~/Temp] awk --version
awk version 20070501
Starting with GNU awk 4.1 you can use --bignum or -M
$ awk 'BEGIN {print 19769904399993903}'
19769904399993904
$ awk --bignum 'BEGIN {print 19769904399993903}'
19769904399993903
ยง Command-Line Options
I believe the underlying numeric format in this case is an IEEE double. So the changed value is a result of floating point precision errors. If it is actually necessary to treat the large values as numerics and to maintain accurate precision, it might be better to use something like Perl, Ruby, or Python which have the capabilities (maybe via extensions) to handle arbitrary-precision arithmetic.
UPDATE: Recent versions of GNU awk support arbitrary precision arithmetic. See the GNU awk manual for more info.
ORIGINAL POST CONTENT:
XMLgawk supports arbitrary precision arithmetic on floating-point numbers.
So, if installing xgawk is an option:
zsh-4.3.11[drado]% awk --version |head -1; xgawk --version | head -1
GNU Awk 4.0.0
Extensible GNU Awk 3.1.6 (build 20080101) with dynamic loading, and with statically-linked extensions
zsh-4.3.11[drado]% awk 'BEGIN {
x=665857
y=470832
print x^4 - 4 * y^4 - 4 * y^2
}'
11885568
zsh-4.3.11[drado]% xgawk -lmpfr 'BEGIN {
MPFR_PRECISION = 80
x=665857
y=470832
print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)
}'
1.0000000000000000000000000
This answer was partially answered by #Mark Wilkins and #Dennis Williamson already but I found out the largest 64-bit integer that can be handled without losing precision is 2^53.
Eg awk's reference page
http://www.gnu.org/software/gawk/manual/gawk.html#Integer-Programming
(sorry if my answer is too old. Figured I'd still share for the next person before they spend too much time on this like I did)
You're running into Awk's Floating Point Representation Issues. I don't think you can find a work-around within awk framework to perform arithmetic on huge numbers accurately.
Only possible (and crude) way I can think of is to break the huge number into smaller chunk, perform your math and join them again or better yet use Perl/PHP/TCL/bsh etc scripting languages that are more powerful than awk.
Using nawk on Solaris 11, I convert the number to a string by adding (concatenate) a null to the end, and then use %15s as the format string:
printf("%15s\n", bignum "")
another caveat about the precision :
the errors pile up with extra operations ::
echo 19769904399993903 | mawk2 '{ CONVFMT = "%.2000g";
OFMT = "%.20g";
} {
print;
print +$0;
print $0/1.0
print $0^1.0;
print exp(-log($0))^-1;
print exp(1*log($0))
print sqrt(exp(exp(log(20)-log(10))*log($0)))
print (exp(exp(log(6)-log(3))*log($0)))^2^-1
}'
19769904399993903
19769904399993904
19769904399993904
19769904399993904
19769904399993912
19769904399993908
19769904399993628 <<<โโ -275
19769904399993768 <<<โ- -135
The first few only off by less than 10.
last 2 equations have triple digit deltas.
For any of the versions that require calling helper math functions, simply getting the -M bignum flag is insufficient. One must also set the PREC variable.
For this exmaple, setting PREC=64 and OFMT="%.17g" should suffice.
Beware of setting OFMT too high, relative to PREC, otherwise you'll see oddities like this :
gawk -M -v PREC=256 -e '{ CONVFMT="%.2000g"; OFMT="%.80g";... } '
19769904399993903
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
since 80 significant digits require precision of at least 265.75, so basically 266-bits, but gawk is fast enough that you can probably safely pre-set it at PREC=4096/8192 instead of having to worry about it everytime
I used this awk command below to create a new UUID column in a table in my existing .dat files.
$ awk '("uuidgen" | getline uuid) > 0 {print uuid "|" $0} {close("uuidgen")}' $filename > ${filename}.pk
The problem is that my .dat files are pretty big (like 50-60 GB) and this awk command takes hours even on small data files (like 15MB).
Is there any way to increase the speed of this awk command?
I wonder if you might save time by not having awk open and close uuidgen every line.
$ function regen() { while true; do uuidgen; done; }
$ coproc regen
$ awk -v f="$filename" '!(getline line < f){exit} {print $0,line}' OFS="|" < /dev/fd/${COPROC[0]} > "$filename".pk
This has awk reading your "real" filename from a variable, and the uuid from stdin, because the call to uuidgen is handled by a bash "coprocess". The funky bit around the getline is to tell awk to quit once it runs out of input from $filename. Also, note that awk is taking input from input redirection instead of reading the file directly. This is important; the file descriptor at /dev/fd/## is a bash thing, and awk can't open it.
This should theoretically save you time doing unnecessary system calls to open, run and close the uuidgen binary. On the other hand, the coprocess is doing almost the same thing anyway by running uuidgen in a loop. Perhaps you'll see some improvement in an SMP environment. I don't have a 50GB text file handy for benchmarking. I'd love to hear your results.
Note that coproc is a feature that was introduced with bash version 4. And use of /dev/fd/* requires that bash is compiled with file descriptor support. In my system, it also means I have to make sure fdescfs(5) is mounted.
I just noticed the following on my system (FreeBSD 11):
$ /bin/uuidgen -
usage: uuidgen [-1] [-n count] [-o filename]
If your uuidgen also has a -n option, then adding it to your regen() function with ANY value might be a useful optimization, to reduce the number of times the command needs to be reopened. For example:
$ function regen() { while true; do uuidgen -n 100; done; }
This would result in uuidgen being called only once every 100 lines of input, rather than for every line.
And if you're running Linux, depending on how you're set up, you may have an alternate source for UUIDs. Note:
$ awk -v f=/proc/sys/kernel/random/uuid '{getline u<f; close(f); print u,$0}' OFS="|" "$filename" "$filename".pk
This doesn't require the bash coproc, it just has awk read a random uuid directly from a Linux kernel function that provides them. You're still closing the file handle for every line of input, but at least you don't have to exec the uuidgen binary.
YMMV. I don't know what OS you're running, so I don't know what's likely to work for you.
Your script is calling shell to call awk to call shell to call uuidgen. Awk is a tool for manipulating text, it's not a shell (an environment to call other tools from) so don't do that, just call uuidgen from shell:
$ cat file
foo .*
bar stuff
here
$ xargs -d $'\n' -n 1 printf '%s|%s\n' "$(uuidgen)" < file
5662f3bd-7818-4da8-9e3a-f5636b174e94|foo .*
5662f3bd-7818-4da8-9e3a-f5636b174e94|bar stuff
5662f3bd-7818-4da8-9e3a-f5636b174e94|here
I'm just guessing that the real problem here is that you're running a sub-process for each line. You could read your file explicitly line by line and read output from a batch-uuidgen line by line, and thus only have a single subprocess to handle at once. Unfortunately, uuidgen doesn't work that way.
Maybe another solution?
perl -MData::UUID -ple 'BEGIN{ $ug = Data::UUID->new } $_ = lc($ug->to_string($ug->create)) . " | " . $_' $filename > ${filename}.pk
Might this be faster?
I've written a simple parser in BASH to take apart csv files and dump to a (temp) SQL-input file. The performance on this is pretty terrible; when running on a modern system I'm barely cracking 100 lines per second. I realize the ultimate answer is to rewrite this in a more performance oriented language, but as a learning opportunity, I'm curious where I can improve my BASH skills.
I suspect there are gains to be made by writing to an ram instead of to a file, then flushing all the text at once to the file, but I'm not clear on where/when BASH gets upset about memory usage (largest files I've parsed have been under 500MB).
The following code-block seems to eat most of the cycles, and as I understand, needs to be processed linearly due to checking timestamps (the data has a timestamp, but no timedate stamp, so I was forced ask the user for the start-day and check if the timestamp has cycled 24:00 -> 0:00), so parallel processing didn't seem like an option.
while read p; do
linetime=`printf "${p}" | awk '{printf $1}'`
# THE DATA LACKS FULL DATESTAMPS, SO FORCED TO ASK USER FOR START-DAY & CHECK IF THE DATE HAS CYCLED
if [[ "$lastline" > "$linetime" ]]
then
experimentdate=$(eval $datecmd)
fi
lastline=$linetime
printf "$p" | awk -v varout="$projname" -v experiment_day="$experimentdate " -v singlequote="$cleanquote" '{printf "insert into tool (project,project_datetime,reported_time,seconds,intensity) values ("singlequote""varout""singlequote","singlequote""experiment_day $1""singlequote","singlequote""$1""singlequote","$2","$3");\n"}' >> $sql_input_file
Ignore the singlequote nonsense, I needed this to run on both OSX & 'nix, so I had to workaround some issues with OSX's awk and singlequotes.
Any suggestions for how I can improve performance?
You do not want to start awk for every line you process in a loop. Replace your loop with awk or replace awk with builtin commands.
Both awk's are only used for printing. Replace these lines with additional parameters to the printf command.
I did not understand the codeblock for datecmd (not using $linetime but using the output variable experimentdate), but this one should be optimised: Can you use regular expressions or some other trick?
So you do not have the tune awk, but decide to use awk completely or get it out of your while-loop.
Your performance would improve if you did all the processing with awk. Awk can read your input file directly, express conditionals, and run external commands.
Awk is not the only one either. Perl and Python would be well suited to this task.
I am working with plotting extremely large files with N number of relevant data entries. (N varies between files).
In each of these files, comments are automatically generated at the start and end of the file and would like to filter these out before recombining them into one grand data set.
Unfortunately, I am using MacOSx, where I encounter some issues when trying to remove the last line of the file. I have read that the most efficient way was to use head/tail bash commands to cut off sections of data. Since head -n -1 does not work for MacOSx I had to install coreutils through homebrew where the ghead command works wonderfully. However the command,
tail -n+9 $COUNTER/test.csv | ghead -n -1 $COUNTER/test.csv >> gfinal.csv
does not work. A less than pleasing workaround was I had to separate the commands, use ghead > newfile, then use tail on newfile > gfinal. Unfortunately, this will take while as I have to write a new file with the first ghead.
Is there a workaround to incorporating both GNU Utils with the standard Mac Utils?
Thanks,
Keven
The problem with your command is that you specify the file operand again for the ghead command, instead of letting it take its input from stdin, via the pipe; this causes ghead to ignore stdin input, so the first pipe segment is effectively ignored; simply omit the file operand for the ghead command:
tail -n+9 "$COUNTER/test.csv" | ghead -n -1 >> gfinal.csv
That said, if you only want to drop the last line, there's no need for GNU head - OS X's own BSD sed will do:
tail -n +9 "$COUNTER/test.csv" | sed '$d' >> gfinal.csv
$ matches the last line, and d deletes it (meaning it won't be output).
Finally, as #ghoti points out in a comment, you could do it all using sed:
sed -n '9,$ {$!p;}' file
Option -n tells sed to only produce output when explicitly requested; 9,$ matches everything from line 9 through (,) the end of the file (the last line, $), and {$!p;} prints (p) every line in that range, except (!) the last ($).
I realize that your question is about using head and tail, but I'll answer as if you're interested in solving the original problem rather than figuring out how to use those particular tools to solve the problem. :)
One method using sed:
sed -e '1,8d;$d' inputfile
At this level of simplicity, GNU sed and BSD sed both work the same way. Our sed script says:
1,8d - delete lines 1 through 8,
$d - delete the last line.
If you decide to generate a sed script like this on-the-fly, beware of your quoting; you will have to escape the dollar sign if you put it in double quotes.
Another method using awk:
awk 'NR>9{print last} NR>1{last=$0}' inputfile
This works a bit differently in order to "recognize" the last line, capturing the previous line and printing after line 8, and then NOT printing the final line.
This awk solution is a bit of a hack, and like the sed solution, relies on the fact that you only want to strip ONE final line of the file.
If you want to strip more lines than one off the bottom of the file, you'd probably want to maintain an array that would function sort of as a buffered FIFO or sliding window.
awk -v striptop=8 -v stripbottom=3 '
{ last[NR]=$0; }
NR > striptop*2 { print last[NR-striptop]; }
{ delete last[NR-striptop]; }
END { for(r in last){if(r<NR-stripbottom+1) print last[r];} }
' inputfile
You specify how much to strip in variables. The last array keeps a number of lines in memory, prints from the far end of the stack, and deletes them as they are printed. The END section steps through whatever remains in the array, and prints everything not prohibited by stripbottom.