Reading a file in Python: slurp or filter? - performance

I want to compare the effect of processing a stream as a filter (that is, get a bit, process, rinse), against slurping (that is, get all information, then process).
However, when I run the two codes below, I get comparable results. I was expecting to obtain a much worse result in the slurp version.
Are the codes snippets below doing anything different as described above? If they are equivalent, how could I adapt one of them for testing the filter/slurp difference?
I was testing the scripts with:
jot 100000000 | time python3 dont_slurp.py > /dev/null
jot 100000000 | time python3 slurp.py > /dev/null
Jot generates numbers from 1 to x. The codes snippets just numerate the lines.
Filter:
import sys
lineno = 0
for line in sys.stdin:
lineno += 1
print("{:>6} {}".format(lineno, line[:-1]))
Slurp:
import sys
f = sys.stdin
lineno = 0
for line in f:
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))

First of all, your code samples are not doing what you think. All f = sys.stdin does is set f to the same file handle. The lines for line in f: and for line in sys.stdin: are functionally identical.
What you want is this:
import sys
lineno = 0
for line in sys.stdin.readlines():
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
readlines() returns a list, one element per line in the file. I believe it is not a generator, so you get the full list. The file handle itself acts as a generator, giving you one line at a time.
You should see performance differences with readline().
However, the answer to "which is better?" is "it depends". When you read line by line, you're making a system call, which in turn causes the OS to read file contents off of the disk in blocks. These blocks are likely larger than the size of the average line, and the block is likely cached. That means sometimes you hit the disk, taking lots of time, other times you hit the cache, taking little time.
When you read all at once, you load every byte from the file into memory at once. If you have enough free memory to hold all file contents, then this takes exactly the same amount of time as the line-by-line version. In both cases, it is basically just the time required to read the whole file sequentially with some little bit of overhead.
The difference is the case where you don't have enough free memory to hold the entire file. In that case, you read the whole file, but parts of it get swapped back out to disk by the virtual memory system. They then have to get pulled in again when you access that particular line.
Exactly how much time is lost depends on how much memory is in use, how much other activity is going on on your system, etc., so it can't be quantified in general.
This is a case where you honestly shouldn't worry about it until there is a problem. Do what is more natural in the code and only worry about performance if your program is too slow.

Related

Incremental text file processing for parallel processing

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.
I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use #parallel with the following code:
d = open(f)
lis = readlines(d)
ntrue = #parallel (+) for li in lis
contains(li,s)
end
println(ntrue)
close(d)
end
I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:
Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2
GB;
My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing?
Secondly, since the CPU gain related to the use of #parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!
Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):
const s = "needle" # it's important for this to be const
open(f) do d
ntrue = 0
for li in eachline(d)
ntrue += contains(li,s)
end
println(ntrue)
end
This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.
If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.
If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:
julia> s = "boo"
"boo"
julia> f = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> parse(Int, readchomp(`grep -c -F $s $f`))
292
Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

Fortran implied do write speedup

tl;dr: I found that an "implied do" write was slower than an explicit one under certain circumstances, and want to understand why/if I can improve this.
Details:
I've got a code that does something to the effect of:
DO i=1,n
calculations...
!m, x, and y all change each pass through the loop
IF(m.GT.1)THEN
DO j=1,m
WRITE(10,*)x(j),y(j) !where 10 is an output file
ENDDO
ENDIF
ENDDO
The output file ends up being fairly large, and so it seems like the writing is a big performance factor, so I wanted to optimize it. Before anyone asks, no, moving away from ASCII isn't an option due to various downstream requirements. Accordingly, I rewrote the IF statement (and contents) as:
IF(m.GT.1)THEN
!build format statement for write
WRITE(mm1,*)m-1
mm1=ADJUSTL(mm1)
!implied do write statement
WRITE(10,'('//TRIM(mm1)//'(i9,1x,f7.5/),i9,1x,f7.5)')(x(j),y(j),j=1,m)
ELSEIF(m.EQ.1)THEN
WRITE(10,'(i9,1x,f7.5)')x(1),y(1)
ENDIF
This builds the format statement according to the # of values to be written out, then does a single write statement to output things. I've found that the code actually runs slower with this formulation. For reference, I've seen significant speedup on the same system (hardware and software) when going to an implied do write statement when the amount of data to be written was fixed. Under the assumption that the WRITE statement, itself, is faster, then that would mean the overhead from the couple of lines building that statement are what take the added time, but that seems hard to believe. For reference, m can vary a fair amount, but probably averages at least 1000. Is the concatenation of strings // a very slow operator, or is there something else I'm missing? Thanks in advance.
I haven't specific timing information to add, but your data transfer with an implied do loop is needlessly complicated.
In the first fragment, with the explicit looping, you are writing each pair of numbers to distinct records and you wish to repeat this output with the implied do loop. To do this, you use the slash edit descriptor to terminate each record once a pair has been written.
The needless complexity comes from two areas:
you have distinct cases for one/more than one pair;
for the more-than-one case you construct a format including a "dynamic" repeat count.
As Vladimir F comments you could just use a very large repeat count: it isn't erroneous for an edit descriptor to be processed when there are no more items to be written. The output terminates (successfully) when reaching such a non-matching descriptor. You could, then, just write
WRITE(10,'(*(i9,1x,f7.5/))') (x(j),y(j),j=1,m) ! * replacing a large count
rather than the if construct and the format creation.
Now, this doesn't quite match your first output. As I mentioned above, output termination comes about when a data edit descriptor is reached when there is no corresponding item to output. This means that / will be processed before that happens: you have a final empty record.
The colon edit descriptor is useful here:
WRITE(10,'(*(i9,1x,f7.5,:,/))') (x(j),y(j),j=1,m)
On reaching a : processing stops immediately if there is no remaining output item to process.
But my preferred approach is the far simpler
WRITE(10,'(i9,1x,f7.5)') (x(j),y(j),j=1,m) ! No repeat count
You had the more detailed format to include record termination. However, we have what is known as format reversion: if a format end is reached and more remains to be output then the record is terminated and processing goes back to the start of the format.
Whether these things make your output faster remains to be seen, but they certainly make the code itself much cleaner and clearer.
As a final note, it used to be trendy to avoid additional X editing. If your numbers fit inside the field of width 7 then 1x,f7.5 could be replaced by f8.5 and have the same look: the representation is right-justified in the field. It was claimed that this reduction had performance benefits with fewer switching of descriptors.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

how to deal with a big text file(about 300M)

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only(hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?
300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.
How to solve this problem with a good scalability:
The problem can be solved by 2 map-reduce steps:
Step 1:
map(word):
emit(word,1)
Combine + Reduce(word,list<k>):
emit(word,sum(list))
After this step you have a list of (word,#occurances)
Step 2:
map(word,k):
emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
find top 10 and yield (word,k) for the top 10. //see appendix1 for details
In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.
Solution for 300 MB file:
Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.
Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.
Appendix1:
How to get top k:
Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.
Fill the heap with first k elements.
For each element e:
If e > min.heap():
remove the smallest element from the heap, and add e instead.
Also, more details in this thread
Assuming that you have 1 word per line, you can do the following in python
from collections import Counter
FILE = 'test.txt'
count = Counter()
with open(FILE) as f:
for w in f.readlines():
count[w.rstrip()] += 1
print count.most_common()[0:10]
Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.
Any language should do the job.
After reading the File once, you have the map.
Then iterate through the map and remember the ten word with the highest count value

Do comments affect Perl performance?

I'm optimizing some frequently run Perl code (once per day per file).
Do comments slow Perl scripts down? My experiments lean towards no:
use Benchmark;
timethese(20000000, {
'comments' => '$b=1;
# comment ... (100 times)
', 'nocomments' => '$b=1;'});
Gives pretty much identical values (apart from noise).
Benchmark: timing 10000000 iterations of comments, nocomments...
comments: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) # 18832391.71/s (n=10000000)
nocomments: 0 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) # 22935779.82/s (n=10000000)
Benchmark: timing 20000000 iterations of comments, nocomments...
comments: 0 wallclock secs ( 0.86 usr + -0.01 sys = 0.84 CPU) # 23696682.46/s (n=20000000)
nocomments: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) # 22099447.51/s (n=20000000)
I get similar results if I run the comments and no-comments versions as separate Perl scripts.
It seems counter-intuitive though, if nothing else the interpreter needs to read the comments into memory every time.
Runtime performance? No.
Parsing and lexing performance? Yes, of course.
Since Perl tends to parse and lex on the fly, then comments will affect "start up" performance.
Will they affect it noticably? Unlikely.
Perl is a just-in-time compiled language, so comments and POD have no effect on run-time performance.
Comments and POD have a minuscule effect on compile-time, but they're so easy and fast for Perl to parse it's almost impossible to measure the performance hit. You can see this for yourself by using the -c flag to just compile.
On my Macbook, a Perl program with 2 statements and 1000 lines of 70 character comments takes the same time to compile as one with 1000 lines of empty comments as one with just 2 print statements. Be sure to run each benchmark twice to allow your OS to cache the file, otherwise what you're benchmarking is the time to read the file from the disk.
If startup time is a problem for you, it's not because of comments and POD.
Perl compiles a script and then executes it. Comments marginally slow the compile phase, but have zero effect on the run phase.
Perl is not a scripting language in the same sense that shell scripts are. The interpreter does not read the file line by line. The execution of a Perl program is done in two basic stages: compilation and runtime [1]. During the compilation stage the source code is parsed and converted into bytecode. During the runtime stage the bytecode is executed on a virtual machine.
Comments will slow down the parsing stage but the difference is negligible compared to the time required to parse the script itself (which is already very small for most programs). About the only time you're really concerned with parsing time is in a webserver environment where the program could be called many times per second. mod_perl exists to solve this problem.
You're using Benchmark. That's good! You should be looking for ways to improve the algorithm -- not micro-optimizing. Devel::DProf might be helpful to find any hot spots. You absolutely should not strip comments in a misguided attempt to make your program faster. You'll just make it unmaintainable.
[1] This is commonly called "just in time" compilation. Perl actually has several more stages like INIT and END that don't matter here.
The point is: optimize bottlenecks. Reading in a file consists of:
opening the file,
reading in its contents,
closing the file,
parsing the contents.
Of these steps, reading is the fastest part by far (I am not sure about closing, it is a syscall, but you don't have to wait for it to finish). Even if it is 10% of the whole thing (which is is not, I think), then reducing it by half only gives 5% improved performance, at the cost of missing comments (which is a very bad thing). For the parser, throwing away a line that begins with # is not a tangible slowdown. And after that, the comments are gone, so there can be no slowdown.
Now, imagine that you could actually improve the "reading in the script" part by 5% through stripping all comments (which is a really optimistic estimate, see above). How big is the share of "reading in the script" in overall time consumption of the script? Depends on how much it does, of course, but since perl scripts usually read at least one more file, it is 50% at most, but since perl scripts usually do something more, an honest estimate will bring this down to something in the range of 1%. So, the expected efficiency improvement by stripping all comments is at most (very optimistic) 2.5%, but really closer to 0.05%. And then, those where it actually gives more than 1% are already fast since they do almost nothing, so you are again optimizing at the wrong point.
Concluding, optimize bottlenecks.
The Benchmark module is useless in this case. It's only measuring the times to run the code over and over again. Since your code doesn't actually do anything, most of it is optimized it away. That's why you're seeing it run 22 million times a second.
I have almost on entire chapter about this in Mastering Perl. The error of measurement in the Benchmark technique is about 7%. Your benchmark numbers are well within that, so there's virtually no difference.
From Paul Tomblins comment:
Doesn't perl do some sort of on-the-fly compilation? Maybe the comments get discarded early? –
Yes Perl does.
It is a programming language in between compiled and interpreted. The code gets compiled on the fly and then run. the comments usually don't make any difference. The most it would probably effect is when it is initially parsing the file line by line and pre compiling it, you might see a nano second difference.
I would expect that the one comment would only get parsed once, not multiple times in the loop, so I doubt it is a valid test.
I would expect that comments would slightly slow compilation, but I expect it would be too minor to bother removing them.
Do Perl comments slow a script down? Well, parsing it, yes. Executing it after parsing it? No. How often is a script parsed? Only once, so if you have a comment within a for loop, the comment is discarded by the parses once, before the script even runs, once it started running, the comment is already gone (and the script is not stored as script internally by Perl), thus no matter how many times the for loop repeats, the comment won't have an influence. How fast can the parser skip over comments? The way Perl comments are done, very fast, thus I doubt you will notice. You will notice a higher start-up time if you have 5 lines of code and between each line 1 Mio lines of comments... but how likely is that and of what use would a comment that large be?

Resources