Does AWK Buffer Multiple Print Statements Before Writing Them to Disc? - performance

I have an AWK script that writes tens of thousands of pretty long lines to a couple of files and nearly ten thousand lines to a few more files on a network drive (all needed for different purposes). I would like to make the file I/O as efficient as possible for a few reasons.
Does AWK immediately write to a file with every print(f) statement or does it buffer them? If so, how much buffering goes on?
I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.
I am curious, not just for this program, but also to sharpen my "best practices" skills. I program a lot in AWK, but haven't been able to find the answer to this, yet.
Thanks in advance...

Yes, as you can read in GNU Awk manual: I/O functions. Actually that is why fflush (accepted for inclusion in POSIX) exists: to flush the buffers. And here some practical evidence.

As #Quasimodo points out, yes awk buffers it's output by default and you can bypass that by inserting fflush() statements if you like.
For the other part of your question (I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.) - constantly appending to a variable using string concatenation in awk is roughly as slow as I/O since awk has to continually find new memory areas big enough to hold the result of the concatenation, then move the contents of the old location to the new, then append the new text, and then free up the previous area so there'd be no noticeable benefit in execution speed of buffering and then printing all at once vs just printing as you go.

Related

Reading a file in Python: slurp or filter?

I want to compare the effect of processing a stream as a filter (that is, get a bit, process, rinse), against slurping (that is, get all information, then process).
However, when I run the two codes below, I get comparable results. I was expecting to obtain a much worse result in the slurp version.
Are the codes snippets below doing anything different as described above? If they are equivalent, how could I adapt one of them for testing the filter/slurp difference?
I was testing the scripts with:
jot 100000000 | time python3 dont_slurp.py > /dev/null
jot 100000000 | time python3 slurp.py > /dev/null
Jot generates numbers from 1 to x. The codes snippets just numerate the lines.
Filter:
import sys
lineno = 0
for line in sys.stdin:
lineno += 1
print("{:>6} {}".format(lineno, line[:-1]))
Slurp:
import sys
f = sys.stdin
lineno = 0
for line in f:
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
First of all, your code samples are not doing what you think. All f = sys.stdin does is set f to the same file handle. The lines for line in f: and for line in sys.stdin: are functionally identical.
What you want is this:
import sys
lineno = 0
for line in sys.stdin.readlines():
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
readlines() returns a list, one element per line in the file. I believe it is not a generator, so you get the full list. The file handle itself acts as a generator, giving you one line at a time.
You should see performance differences with readline().
However, the answer to "which is better?" is "it depends". When you read line by line, you're making a system call, which in turn causes the OS to read file contents off of the disk in blocks. These blocks are likely larger than the size of the average line, and the block is likely cached. That means sometimes you hit the disk, taking lots of time, other times you hit the cache, taking little time.
When you read all at once, you load every byte from the file into memory at once. If you have enough free memory to hold all file contents, then this takes exactly the same amount of time as the line-by-line version. In both cases, it is basically just the time required to read the whole file sequentially with some little bit of overhead.
The difference is the case where you don't have enough free memory to hold the entire file. In that case, you read the whole file, but parts of it get swapped back out to disk by the virtual memory system. They then have to get pulled in again when you access that particular line.
Exactly how much time is lost depends on how much memory is in use, how much other activity is going on on your system, etc., so it can't be quantified in general.
This is a case where you honestly shouldn't worry about it until there is a problem. Do what is more natural in the code and only worry about performance if your program is too slow.

Incremental text file processing for parallel processing

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.
I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use #parallel with the following code:
d = open(f)
lis = readlines(d)
ntrue = #parallel (+) for li in lis
contains(li,s)
end
println(ntrue)
close(d)
end
I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:
Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2
GB;
My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing?
Secondly, since the CPU gain related to the use of #parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!
Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):
const s = "needle" # it's important for this to be const
open(f) do d
ntrue = 0
for li in eachline(d)
ntrue += contains(li,s)
end
println(ntrue)
end
This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.
If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.
If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:
julia> s = "boo"
"boo"
julia> f = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> parse(Int, readchomp(`grep -c -F $s $f`))
292
Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

Fortran implied do write speedup

tl;dr: I found that an "implied do" write was slower than an explicit one under certain circumstances, and want to understand why/if I can improve this.
Details:
I've got a code that does something to the effect of:
DO i=1,n
calculations...
!m, x, and y all change each pass through the loop
IF(m.GT.1)THEN
DO j=1,m
WRITE(10,*)x(j),y(j) !where 10 is an output file
ENDDO
ENDIF
ENDDO
The output file ends up being fairly large, and so it seems like the writing is a big performance factor, so I wanted to optimize it. Before anyone asks, no, moving away from ASCII isn't an option due to various downstream requirements. Accordingly, I rewrote the IF statement (and contents) as:
IF(m.GT.1)THEN
!build format statement for write
WRITE(mm1,*)m-1
mm1=ADJUSTL(mm1)
!implied do write statement
WRITE(10,'('//TRIM(mm1)//'(i9,1x,f7.5/),i9,1x,f7.5)')(x(j),y(j),j=1,m)
ELSEIF(m.EQ.1)THEN
WRITE(10,'(i9,1x,f7.5)')x(1),y(1)
ENDIF
This builds the format statement according to the # of values to be written out, then does a single write statement to output things. I've found that the code actually runs slower with this formulation. For reference, I've seen significant speedup on the same system (hardware and software) when going to an implied do write statement when the amount of data to be written was fixed. Under the assumption that the WRITE statement, itself, is faster, then that would mean the overhead from the couple of lines building that statement are what take the added time, but that seems hard to believe. For reference, m can vary a fair amount, but probably averages at least 1000. Is the concatenation of strings // a very slow operator, or is there something else I'm missing? Thanks in advance.
I haven't specific timing information to add, but your data transfer with an implied do loop is needlessly complicated.
In the first fragment, with the explicit looping, you are writing each pair of numbers to distinct records and you wish to repeat this output with the implied do loop. To do this, you use the slash edit descriptor to terminate each record once a pair has been written.
The needless complexity comes from two areas:
you have distinct cases for one/more than one pair;
for the more-than-one case you construct a format including a "dynamic" repeat count.
As Vladimir F comments you could just use a very large repeat count: it isn't erroneous for an edit descriptor to be processed when there are no more items to be written. The output terminates (successfully) when reaching such a non-matching descriptor. You could, then, just write
WRITE(10,'(*(i9,1x,f7.5/))') (x(j),y(j),j=1,m) ! * replacing a large count
rather than the if construct and the format creation.
Now, this doesn't quite match your first output. As I mentioned above, output termination comes about when a data edit descriptor is reached when there is no corresponding item to output. This means that / will be processed before that happens: you have a final empty record.
The colon edit descriptor is useful here:
WRITE(10,'(*(i9,1x,f7.5,:,/))') (x(j),y(j),j=1,m)
On reaching a : processing stops immediately if there is no remaining output item to process.
But my preferred approach is the far simpler
WRITE(10,'(i9,1x,f7.5)') (x(j),y(j),j=1,m) ! No repeat count
You had the more detailed format to include record termination. However, we have what is known as format reversion: if a format end is reached and more remains to be output then the record is terminated and processing goes back to the start of the format.
Whether these things make your output faster remains to be seen, but they certainly make the code itself much cleaner and clearer.
As a final note, it used to be trendy to avoid additional X editing. If your numbers fit inside the field of width 7 then 1x,f7.5 could be replaced by f8.5 and have the same look: the representation is right-justified in the field. It was claimed that this reduction had performance benefits with fewer switching of descriptors.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

Parameter expansion slow for large data sets

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.
For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.
For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
This is why it's slow for larger input, and here's a pretty graph:
As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).
Originally, older shells and other utilities imposed LINE_MAX = 2048
on file input for this kind of reason. For huge variables bash has no
problem parking them in memory. But substitution requires at least two
concurrent copies. And lots of thrashing: as groups of characters are
removed whole strings get rewritten. Over and over and over.
There are tools meant for this - sed is a premiere choice. bash is a
distant second choice. sed works on streams, bash works on memory blocks.
Another choice:
bash is extensible - your can write custom C code to stuff stuff well
when bash was not meant to do it.
CFA Johnson has good articles on how to do that:
Some ready to load builtins:
http://cfajohnson.com/shell/bash/loadables/
DIY builtins explained:
http://cfajohnson.com/shell/articles/dynamically-loadable/

Resources