Incremental text file processing for parallel processing

Incremental text file processing for parallel processing - parallel-processing

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.
I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use #parallel with the following code:
d = open(f)
lis = readlines(d)
ntrue = #parallel (+) for li in lis
contains(li,s)
end
println(ntrue)
close(d)
end
I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:
Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2
GB;
My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing?
Secondly, since the CPU gain related to the use of #parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!

Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):
const s = "needle" # it's important for this to be const
open(f) do d
ntrue = 0
for li in eachline(d)
ntrue += contains(li,s)
end
println(ntrue)
end
This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.
If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.
If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:
julia> s = "boo"
"boo"
julia> f = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> parse(Int, readchomp(`grep -c -F $s $f`))
292
Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

Related

Reading a file in Python: slurp or filter?

I want to compare the effect of processing a stream as a filter (that is, get a bit, process, rinse), against slurping (that is, get all information, then process).
However, when I run the two codes below, I get comparable results. I was expecting to obtain a much worse result in the slurp version.
Are the codes snippets below doing anything different as described above? If they are equivalent, how could I adapt one of them for testing the filter/slurp difference?
I was testing the scripts with:
jot 100000000 | time python3 dont_slurp.py > /dev/null
jot 100000000 | time python3 slurp.py > /dev/null
Jot generates numbers from 1 to x. The codes snippets just numerate the lines.
Filter:
import sys
lineno = 0
for line in sys.stdin:
lineno += 1
print("{:>6} {}".format(lineno, line[:-1]))
Slurp:
import sys
f = sys.stdin
lineno = 0
for line in f:
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))

First of all, your code samples are not doing what you think. All f = sys.stdin does is set f to the same file handle. The lines for line in f: and for line in sys.stdin: are functionally identical.
What you want is this:
import sys
lineno = 0
for line in sys.stdin.readlines():
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
readlines() returns a list, one element per line in the file. I believe it is not a generator, so you get the full list. The file handle itself acts as a generator, giving you one line at a time.
You should see performance differences with readline().
However, the answer to "which is better?" is "it depends". When you read line by line, you're making a system call, which in turn causes the OS to read file contents off of the disk in blocks. These blocks are likely larger than the size of the average line, and the block is likely cached. That means sometimes you hit the disk, taking lots of time, other times you hit the cache, taking little time.
When you read all at once, you load every byte from the file into memory at once. If you have enough free memory to hold all file contents, then this takes exactly the same amount of time as the line-by-line version. In both cases, it is basically just the time required to read the whole file sequentially with some little bit of overhead.
The difference is the case where you don't have enough free memory to hold the entire file. In that case, you read the whole file, but parts of it get swapped back out to disk by the virtual memory system. They then have to get pulled in again when you access that particular line.
Exactly how much time is lost depends on how much memory is in use, how much other activity is going on on your system, etc., so it can't be quantified in general.
This is a case where you honestly shouldn't worry about it until there is a problem. Do what is more natural in the code and only worry about performance if your program is too slow.

Fortran implied do write speedup

tl;dr: I found that an "implied do" write was slower than an explicit one under certain circumstances, and want to understand why/if I can improve this.
Details:
I've got a code that does something to the effect of:
DO i=1,n
calculations...
!m, x, and y all change each pass through the loop
IF(m.GT.1)THEN
DO j=1,m
WRITE(10,*)x(j),y(j) !where 10 is an output file
ENDDO
ENDIF
ENDDO
The output file ends up being fairly large, and so it seems like the writing is a big performance factor, so I wanted to optimize it. Before anyone asks, no, moving away from ASCII isn't an option due to various downstream requirements. Accordingly, I rewrote the IF statement (and contents) as:
IF(m.GT.1)THEN
!build format statement for write
WRITE(mm1,*)m-1
mm1=ADJUSTL(mm1)
!implied do write statement
WRITE(10,'('//TRIM(mm1)//'(i9,1x,f7.5/),i9,1x,f7.5)')(x(j),y(j),j=1,m)
ELSEIF(m.EQ.1)THEN
WRITE(10,'(i9,1x,f7.5)')x(1),y(1)
ENDIF
This builds the format statement according to the # of values to be written out, then does a single write statement to output things. I've found that the code actually runs slower with this formulation. For reference, I've seen significant speedup on the same system (hardware and software) when going to an implied do write statement when the amount of data to be written was fixed. Under the assumption that the WRITE statement, itself, is faster, then that would mean the overhead from the couple of lines building that statement are what take the added time, but that seems hard to believe. For reference, m can vary a fair amount, but probably averages at least 1000. Is the concatenation of strings // a very slow operator, or is there something else I'm missing? Thanks in advance.

I haven't specific timing information to add, but your data transfer with an implied do loop is needlessly complicated.
In the first fragment, with the explicit looping, you are writing each pair of numbers to distinct records and you wish to repeat this output with the implied do loop. To do this, you use the slash edit descriptor to terminate each record once a pair has been written.
The needless complexity comes from two areas:
you have distinct cases for one/more than one pair;
for the more-than-one case you construct a format including a "dynamic" repeat count.
As Vladimir F comments you could just use a very large repeat count: it isn't erroneous for an edit descriptor to be processed when there are no more items to be written. The output terminates (successfully) when reaching such a non-matching descriptor. You could, then, just write
WRITE(10,'(*(i9,1x,f7.5/))') (x(j),y(j),j=1,m) ! * replacing a large count
rather than the if construct and the format creation.
Now, this doesn't quite match your first output. As I mentioned above, output termination comes about when a data edit descriptor is reached when there is no corresponding item to output. This means that / will be processed before that happens: you have a final empty record.
The colon edit descriptor is useful here:
WRITE(10,'(*(i9,1x,f7.5,:,/))') (x(j),y(j),j=1,m)
On reaching a : processing stops immediately if there is no remaining output item to process.
But my preferred approach is the far simpler
WRITE(10,'(i9,1x,f7.5)') (x(j),y(j),j=1,m) ! No repeat count
You had the more detailed format to include record termination. However, we have what is known as format reversion: if a format end is reached and more remains to be output then the record is terminated and processing goes back to the start of the format.
Whether these things make your output faster remains to be seen, but they certainly make the code itself much cleaner and clearer.
As a final note, it used to be trendy to avoid additional X editing. If your numbers fit inside the field of width 7 then 1x,f7.5 could be replaced by f8.5 and have the same look: the representation is right-justified in the field. It was claimed that this reduction had performance benefits with fewer switching of descriptors.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.

If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.

The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

how to deal with a big text file(about 300M)

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only（hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?

300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.

How to solve this problem with a good scalability:
The problem can be solved by 2 map-reduce steps:
Step 1:
map(word):
emit(word,1)
Combine + Reduce(word,list<k>):
emit(word,sum(list))
After this step you have a list of (word,#occurances)
Step 2:
map(word,k):
emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
find top 10 and yield (word,k) for the top 10. //see appendix1 for details
In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.
Solution for 300 MB file:
Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.
Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.
Appendix1:
How to get top k:
Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.
Fill the heap with first k elements.
For each element e:
If e > min.heap():
remove the smallest element from the heap, and add e instead.
Also, more details in this thread

Assuming that you have 1 word per line, you can do the following in python
from collections import Counter
FILE = 'test.txt'
count = Counter()
with open(FILE) as f:
for w in f.readlines():
count[w.rstrip()] += 1
print count.most_common()[0:10]

Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.
Any language should do the job.
After reading the File once, you have the map.
Then iterate through the map and remember the ten word with the highest count value

VBA I/O Performance

I'd like to know if there is a performance difference between those two codes :
Open strFile For Output As #fNum
For var1 = 1 to UBound(strvar1)
For var2 = 1 to UBound(strvar2)
For var3 = 1 to UBound(strvar3)
For var4 = 1 to UBound(strvar4)
Print #fNum texte
Next var4
Next var3
Next var2
Next var1
Close #fNum
And
For var1 = 1 to UBound(strvar1)
For var2 = 1 to UBound(strvar2)
For var3 = 1 to UBound(strvar3)
For var4 = 1 to UBound(strvar4)
texteTotal = texteTotal + texte
Next var4
Next var3
Next var2
Next var1
Open strFile For Output As #fNum
Print #fNum texteTotal
Close #fNum
In case, the loops are pretty big ?

You'll have to try it, because it depends on the size of texte.
Each time you do texteTotal = texteTotal + texte, vba makes a fresh copy of textTotal. As textTotal gets larger and larger, your loop will slow down.
You also run the risk of creating a string larger than vba can handle.
So:
If you are writing to a network drive, and texte is a single character, the second approach will probably be better.
If you are writing to a fast local disc, and texte is 64kb, and the arrays are 1M entries each, the first approach will be better.

Since you said that texte and texteTotal are strings, I have a couple of suggestions:
1. Always concatenate strings with the & operator.
In VBScript, there are two ways to concatenate (add together) two string variables: the & operator and the + operator. The + operator is normally used to add together two numeric values, but is retained for backwards compatibility with older versions of BASIC that did not have the & operator for strings. Because the & operator is available in VBScript, it is recommended that you always prefer using it to concatenate strings and reserve the + for adding together numeric values. This won't necessarily provide any speed increase, but it eliminates any ambiguity in your code and makes clear your original intention. This is especially important in VBScript where you're working with Variants. When you use the + operator on a string that may contain numeric values, you have no way of knowing whether it will add the two numeric values together or combine the two strings.
2. Remember that string concatenation in VBScript has tremendous overhead and is very inefficient.
Unlike VB.NET and Java, VBScript does not have a StringBuilder class to aid in the creation of large strings. Instead, whenever you repeatedly add things to the end of a string variable, VB will repeatedly copy that variable over and over again. When you're building a string in a loop like you do in the above code, this can really degrade performance as VB constantly allocates space for the new string variable and performs a copy. With each iteration of the loop, the concatenation becomes slower and slower (in geek-speak, you're dealing with an n² algorithm, where n = the number of concatenations). The problem gets even worse if the string exceeds 64K in size. VB can store small strings in a 64K cache, but if a string becomes larger than the cache, performance drops even more. This kind of thing is of course invisible to the programmer for simplicity, but if you're concerned about optimization, it's something to understand is happening in the background.
In light of the above information, let's revisit the two code samples that you posted. You said that `texte` is "not very big but there are hundreds [of] thousands [of] lines." That means you may easily run out of space in the 64K string cache, and eventually you may even run out of space in the RAM allocated to the script. The limit varies, but you can eventually get "Out of Memory" errors, depending on how large your string grows. Even if you're not anywhere near that point now, it's worth considering for the future. If someone goes back later to add functionality to the script, will they remember or bother to change the string concatenation algorithm? Probably not.
To prevent any "Out of Memory" errors from cropping up, you can simply stop keeping the text string in RAM and write it directly to a file instead. This makes even more sense in your case because that's what you're eventually going to do with the string anyway! Why waste CPU cycles and space in memory by continually allocating and reallocating new string variables for each iteration of the loop when you could just write the value to the file and forget about it? I'd say that your first code sample is the simplest and preferred method to accomplish what you want.
The only time that I would consider the second method is if you were dealing with file I/O that was extremely inefficient, such as a disk being accessed over a network connection. As long as you're writing to a local disk, you're not going to have any performance problems here. The other concern, pointed out by #astander, is that the first method leaves the file you're writing to open for a long period of time, placing a lock on that resource. While this is an important concern, I think its implication for your application is minimal, as I assume that you're creating and writing to your own private file that no other application is expected to be able to access.
However, despite my recommendation, the first method is still not the most optimized and efficient way to concatenate strings. The best way would be a VBScript implementation of a StringBuilder class that stores the string in memory in an array of bytes as it is created. Periodically, as you add text to the string, the concatenation class would allocate more space to the array to hold the additional text. This would be much faster than the native VBScript implementation of string concatenation because it would perform these reallocations far less often. Additionally, as the string you're building in memory grows larger, your StringBuilder class could flush the contents of the string in memory to the file, and start over with an empty string. You could use something like Francesco Balena's CString class (http://www.vbcode.com/asp/showsn.asp?theID=415), or Microsoft's example (complete with some benchmarks and a further explanation) available here: http://support.microsoft.com/kb/170964.

I think the biggest difference would be the period for which you have the file open.
In the second case I would assume that it will be open for a shorter period of time, which is better as you should only ever lock resources for the smallest period of time required.

Cody Thank you very much for your time.
Here are a few more information: The Code Sample #1 is the current production code. To be more precise, this is one part of the whole process from
1)getting information from a DB #1
2)calculation with mathematics formulas (matrices, vectors) , N= BIG
3)copy results to txt files ( 10 k lignes + ) each.
4)psql queries to insert to Databases
5)restitution I am wondering if the copy to txt files is really necessary and costly compared to psql insertions? I like your idea of building a custom string class, do you think it can overtake the I/O performance?

Ran into this issue recently, where I was writing large amounts of text (~100k lines) to a network file. As each Print command creates I/O activity, the process of writing the file was terribly slow. However, creating a large string by concatenating new lines to it proved to be very slow as well, as explained in the other answers.
I solved this problem by writing the individual lines to a buffer array, then joining this array into a string, which is then written to the file at once.
Based on your example, it would be something like:
Dim buffer() as Variant
Dim i as Long
i = 1
ReDim buffer(1 to Ubound(strvar1) * Ubound(strvar2) * Ubound(strvar3) * Ubound(strvar4)
For var1 = 1 to UBound(strvar1)
For var2 = 1 to UBound(strvar2)
For var3 = 1 to UBound(strvar3)
For var4 = 1 to UBound(strvar4)
buffer(i) = texte
i = i + 1
Next var4
Next var3
Next var2
Next var1
Open strFile For Output As #fNum
Print #fNum Join(buffer, vbCrLf)
Close #fNum
This prevents both the overhead of the incremental concatenations (the Join function scales linear with the amount of lines, instead of exponential as the concatenation does), and the I/O overhead of writing many lines individually to a network file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio