Memory usage for a parallel stream from File.lines() - java-8

I am reading lines from large files (8GB+) using Files.lines(). If processing sequentially it works great, with a very low memory footprint. As soon as I add parallel() to the stream it seems to hang onto the data it is processing perpetually, eventually causing an out of memory exception. I believe this is the result of the Spliterator caching data when trying to split, but I'm not sure. My only idea left is to write a custom Spliterator with a trySplit method that peels off a small amount of data to split instead of trying to split the file in half or more. Has anyone else encountered this?

Tracing through the code my guess is the Spliterator used by Files.lines() is Spliterators.IteratorSpliterator. whose trySplit() method has this comment:
/*
* Split into arrays of arithmetically increasing batch
* sizes. This will only improve parallel performance if
* per-element Consumer actions are more costly than
* transferring them into an array. The use of an
* arithmetic progression in split sizes provides overhead
* vs parallelism bounds that do not particularly favor or
* penalize cases of lightweight vs heavyweight element
* operations, across combinations of #elements vs #cores,
* whether or not either are known. We generate
* O(sqrt(#elements)) splits, allowing O(sqrt(#cores))
* potential speedup.
*/
The code then looks like it splits into batches of multiples of 1024 records (lines). So the first split will read 1024 lines then the next one will read 2048 lines etc on and on. Each split will read larger and larger batch sizes.
If your file is really big, it will eventually hit a max batch size of 33,554,432 which is 1<<25. Remember that's lines not bytes which will probably cause an out of memory error especially when you start having multiple threads read that many.
That also explains the slow down. Those lines are read ahead of time before the thread can process those lines.
So I would either not use parallel() at all or if you must because the computations you are doing are expensive per line, write your own Spliterator that doesn't split like this. Probably just always using a batch of 1024 is fine.

As mentioned by dkatzel.
This problem is caused by the Spliterator.IteratorSplitter which will batch the elements in your stream. Where the batch size will start with 1024 elements and grow to 33,554,432 elements.
Another solution for this can be to use the FixedBatchSpliteratorBase which is proposed in the article on Faster parallel processing in Java using Streams and a spliterator.

Related

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

How to fix Run Time Error '7' Out of memory in visual basic 6?

I am trying to ZIP a folder with sub folders and files in vb6. For that I read each file and store them one by one in byte array using Redim Preserve. But large folders having size larger than 130MB throw an Out of Memory error.I have 8 GB of RAM in my PC so it shouldn't be a problem.So, is this some limitation by visual basic 6 that we can't use more than 150MB memory?
'Length of a particular File is determined
lngFileLen = FileLen(a_strFilePath)
DoEvents
If lngFileLen <> 0 Then
m_lngPtr = m_lngPtr + lngFileLen
'Next line Throws error once m_lngPtr reaches around 150 MB
ReDim Preserve arrFileBuffer(1 To m_lngPtr)
First of all, VB6 arrays can only be resized to a maximum of 2,147,483,647 elements. However, since that's also the upper limit of a Long in VB6, it seems like that's unlikely to be the problem. However, even though it may be allowed to make an array that big, it's running in a 32-bit process, so it's still subject to the limit of 2GB of addressable memory for the whole process. Since the VB6 run-time has some overhead, it's using some of that memory for other things, and since your program is likely doing other things too, that will be using up some memory too.
In addition to that, when you create an array, the system has to find that number of bytes of contiguous memory. So, even when there is enough memory available, within the 2GB limit, if it's sufficiently fragmented, you can still get out of memory errors. For that reason, creating gigantic arrays is always a concern.
Next, you are using ReDim Preserve, which requires twice the memory. When you resize the array like that, what it actually has to do, under the hood, is create a second array of the new size and then copy all of the other data out of the old array into the new one. Once it's done copying all the data out of the source array, it can then delete it, but while it's performing the copy, it needs to hold both the old array and the new array in memory simultaneously. That means that in a best case scenario, even if there was no other allocated memory or fragmentation, the maximum memory size of an array that you could resize would be 1GB.
Finally, in your example, you never showed what the data type of the array was. If it's an array of bytes, you should be good, I think (where the memory size of the array would only be slightly more than it's length in elements). However, if, for instance, it's an array of strings or variants, then I believe that's going to require a minimum of 4 bytes per element, thereby more-than-quadrupling the memory size of the array.

Minimizing global memory reads in OpenCL with vectors?

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

Parallel text processing in julia

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.
If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:
function open_count(file)
fh = open(file)
text = readall(fh)
length(split(text))
end
tic()
total = 0
for name in files
total += open_count(string(dir,"/",name))
total
end
toc()
elapsed time: 29.474181026 seconds
tic()
total = 0
total = #parallel (+) for name in files
open_count(string(dir,"/",name))
end
toc()
elapsed time: 29.086511895 seconds
I tried different versions but also got no significant speed increases. Am I doing something wrong?
I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.
If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there.
You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.
However, if the time is used to do the regex, than consider the following:
Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.
This is a technique I used when trying to process large text files.

Sort serial data with buffer

Is there any algorithms to sort data from serial input using buffer which is smaller than data length?
For example, I have 100 bytes of serial data, which can be read only once, and 40 bytes buffer. And I need to print out sorted bytes.
I need it in Javascript, but any general ideas are appreciated.
This kind of sorting is not possible in a single pass.
Using your example: suppose you have filled your 40 byte buffer, so you need to start printing out bytes in order to make room for the next one. In order to print out sorted data, you must print the smallest byte first. However, if the smallest byte has not been read, you can't possibly print it out yet!
The closest relevant fit to your question may be external sorting algorithms, which take multiple passes in order to sort data that can't fit into memory. That is, if you have peripherals that can store the output of a processing pass, you can sort data larger than your memory in O(log(N/M)) passes, where N is the size of the problem, and M is the size of your memory.
The classic storage peripheral for external sorting is the tape drive; however, the same algorithms work for disk drives (of whatever kind). Also, as cache hierarchies grow in depth, the principles of external sorting become more relevant even for in-memory sorts -- try taking a look at cache-oblivious algorithms.

Resources