Parallel for loop in Julia - parallel-processing

I am aware there are a multitude of questions about running parallel for loops in Julia, using #threads, #distributed, and other methods. I have tried to implement the solutions there with no luck. The structure of what I'd like to do is as follows.
for index in list_of_indices
data = h5read("data_set_$index.h5")
result = perform_function(data)
save(result)
end
The data sets are independent, and no part of this loop depends on any other. It seems this should be parallelizable.
I have tried, e.g.,
"#threads for index in list_of_indices..." and I get a segmentation error
"#distributed for index in list_of_indices..." and the code does not actually perform the function on my data.
I assume I'm missing something about how parallel processes work, and any insight would be appreciated.
Here is a MWE:
Assume we have files data_1.h5, data_2.h5, data_3.h5 in our working directory. (I don't know how to make things more self-contained than this because I think the problem is arising from asking multiple threads to read files.)
using Distributed
using HDF5
list = [1,2,3]
Threads.#threads for index in list
data = h5read("data_$index.h5", "data")
println(data)
end
The error I get is
signal (11): Segmentation fault
signal (6): Aborted
Allocations: 1587194 (Pool: 1586780; Big: 414); GC: 1
Segmentation fault (core dumped)

As noted by other people there is no enough details. However, given the current state of information the safest code that has the highest chance to work is:
using Distributed
addprocs(4)
#everywhere using HDF5
list = [1,2,3]
#sync #distributed for index in list
data = h5read("data_$index.h5", "data")
println(data)
end
Distributed approach separates processes completely and hence you have much lesser chance to do something wrong (eg. use a library with a shared resource etc).

Related

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Safe writing to variable in cython c wrapper within two python processes or distinct memory for python processes

I am creating a wrapper over c library that recieves some financial data and I want to collect it into python data type (dict with list of field names and list of lists with financial data fields).
On the c level there is function that starts "listening" to some port and when any event appears some user-defined function is called. This function is written in cython. Simplified example of such function is here:
cdef void default_listener(const event_data_t* data, int data_count, void* user_data):
cdef trade_t* trades = <trade_t*>data # cast recieved data according to expected type
cdef dict py_data = <dict>user_data # cast user_data to initial type(dict in our case)
for i in range(data_count):
# append to list in the dict that we passed to the function
# fields of recieved struct
py_data['data'].append([trades[i].price,
trades[i].size,
]
)
The problem: when there is only one python process with this function started, there are no problems, but if I start another python process and run the same function one of the processes will be terminated in undetermiined amount of time. I suppose that this happens because two functions that are called simultaniously in different processes may try to write to the same part of the memory. May this be the case?
If this is the case, are there any ways to prevent two processes use the same memory? Or maybe some lock can be established before the cython code starts to write?
P.S.: I also have read this article and according to it for each python process there is some memory allocated that does not intersect with parts for other processes. But it is unclear for me, is this allocated memory also available for underlying c functions or these functions have acces to another fields that may intersect
I'm taking a guess at the answer based on your comment - if it's wrong then I'll delete it, but I think it's likely enough to be right to be worth posting as an answer.
Python has a locking mechanism known as the Global Interpreter Lock (or GIL). This ensures that multiple threads don't attempt to access the same memory simultaneously (including memory internal to Python, that may not be obvious to the user).
Your Cython code will be working on the assumption that its thread holds the GIL. I strongly suspect that this isn't true, and so performing any operations on a Python object will likely cause a crash. One way to deal with this would be to follow this section of documentation in the C code that calls the Cython code. However, I suspect it's easier to handle in Cython.
First tell Cython that the function is "nogil" - it does not require the GIL:
cdef void default_listener(const event_data_t* data, int data_count, void* user_data) nogil:
If you try to compile now it will fail, since you use Python types within the function. To fix this, claim the GIL within your Cython code.
cdef void default_listener(...) nogil:
with gil:
default_listener_impl(...)
What I've done is put the implementation in a separate function that does require the GIL (i.e. doesn't have a nogil attached). The reason for this is that you can't put cdef statements in the with gil section (as you say in your comment) - they have to be outside it. However, you can't put cdef dict outside it, because it's a Python object. Therefore a separate function is the easiest solution. The separate function looks almost exactly like default_listener does now.
It's worth knowing that this isn't a complete locking mechanism - it's really only to protect the Python internals from being corrupted - an ordinary Python thread will release and regain the GIL periodically automatically, and that may be while you're "during" an operation. Cython won't release the GIL unless you tell it to (in this case, at the end of the with gil: block) so does hold an exclusive lock during this time. If you need finer control of locking then you may want to look at either the multithreading library, or wrapping some C locking library.

Julia: serialize error when sending large objects to workers

I am trying to send data from the master process to worker processes. I am able to do so just fine with relatively small pieces of data. But, as soon as they get above a certain size, I encounter a serialize error.
Is there a way to resolve this, or would I just need to break my objects down into smaller pieces and then reassemble them on the workers? If so, is there a good way to determine ahead of time the max size that I can send (which I suppose may be dependent upon system variables)? Below is code showing a transfer that works and one that fails. It's possible the sizes might need to be tinkered with to reproduce on other systems.
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
X1 = rand(10^5, 10^3);
X2 = rand(10^6, 10^3);
sendto(2, X1 = X1) ## works fine
sendto(2, X2 = X2)
ERROR: write: invalid argument (EINVAL)
in yieldto at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in wait at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in stream_wait at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in uv_write at stream.jl:962
in buffer_or_write at stream.jl:982
in write at stream.jl:1011
in serialize_array_data at serialize.jl:164
in serialize at serialize.jl:181
in serialize at serialize.jl:127
in serialize at serialize.jl:310
in serialize_any at serialize.jl:422
in send_msg_ at multi.jl:222
in remotecall at multi.jl:726
in sendto at none:3
Note: I have plenty of system memory, even for two copies of the larger object, so the problem isn't in that.
This issue seems to be resolved now with Julia 0.5.

Streaming JSON Algorithm - WITHOUT STACK

Is it possible to design a streaming JSON algorithm that writes JSON directly to a socket with the following properties:
* can only write to, but cannot delete or seek within the stream
* does not use either an IMPLICIT or explicit stack
* uses only a constant amount of memory stack depth no matter how deep the object nesting within the json
{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{"1":{...}}}}}}}}}}}}}}}}}
Short answer: No.
Slightly longer: At least not in the general case. If you could guarantee that the nesting has no branching, you could use a simple counter to close the braces at the end.
No, because you could use such a program to compress infinite amounts of memory into a finite space.
Encoding implementation:
input = read('LIBRARY_OF_CONGRESS.tar.bz2')
input_binary = convert_to_binary(input)
json_opening = replace({'0':'[', '1':'{'}, input_binary)
your_program <INPUTPIPE >/dev/null
INPUTPIPE << json_opening
Execute the above program then clone virtual machine it is running on. That is your finite-space compressed version of the infinitely-large input data set. Then to decode...
Decoding implementation:
set_output_pipe(your_program, OUTPUTPIPE)
INPUTPIPE << EOL
json_closing << OUTPUTPIPE
output_binary = replace({']':'0', '}':'1'}, reverse(json_closing))
output = convert_from_binary(output_binary)
write(output, 'LIBRARY_OF_CONGRESS-copy.tar.bz2')
And of course, all good code should have a test case...
Test case:
bc 'LIBRARY_OF_CONGRESS.tar.bz2' 'LIBRARY_OF_CONGRESS-copy.tar.bz2'

Ruby Memory Management

I have been using Ruby for a while now and I find, for bigger projects, it can take up a fair amount of memory. What are some best practices for reducing memory usage in Ruby?
Please, let each answer have one "best practice" and let the community vote it up.
When working with huge arrays of ActiveRecord objects be very careful... When processing those objects in a loop if on each iteration you are loading their related objects using ActiveRecord's has_many, belongs_to, etc. - the memory usage grows a lot because each object that belongs to an array grows...
The following technique helped us a lot (simplified example):
students.each do |student|
cloned_student = student.clone
...
cloned_student.books.detect {...}
ca_teachers = cloned_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
# Not sure if the following is necessary, but we have it just in case...
cloned_student = nil
end
In the code above "cloned_student" is the object that grows, but since it is "nullified" at the end of each iteration this is not a problem for huge array of students. If we didn't do "clone", the loop variable "student" would have grown, but since it belongs to an array - the memory used by it is never released as long as array object exists.
Different approach works too:
students.each do |student|
loop_student = Student.find(student.id) # just re-find the record into local variable.
...
loop_student.books.detect {...}
ca_teachers = loop_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
end
In our production environment we had a background process that failed to finish once because 8Gb of RAM wasn't enough for it. After this small change it uses less than 1Gb to process the same amount of data...
Don't abuse symbols.
Each time you create a symbol, ruby puts an entry in it's symbol table. The symbol table is a global hash which never gets emptied.
This is not technically a memory leak, but it behaves like one. Symbols don't take up much memory so you don't need to be too paranoid, but it pays to be aware of this.
A general guideline: If you've actually typed the symbol in code, it's fine (you only have a finite amount of code after all), but don't call to_sym on dynamically generated or user-input strings, as this opens the door to a potentially ever-increasing number
Don't do this:
def method(x)
x.split( doesn't matter what the args are )
end
or this:
def method(x)
x.gsub( doesn't matter what the args are )
end
Both will permanently leak memory in ruby 1.8.5 and 1.8.6. (not sure about 1.8.7 as I haven't tried it, but I really hope it's fixed.) The workaround is stupid and involves creating a local variable. You don't have to use the local, just create one...
Things like this are why I have lots of love for the ruby language, but no respect for MRI
Beware of C extensions which allocate large chunks of memory themselves.
As an example, when you load an image using RMagick, the entire bitmap gets loaded into memory inside the ruby process. This may be 30 meg or so depending on the size of the image.
However, most of this memory has been allocated by RMagick itself. All ruby knows about is a wrapper object, which is tiny(1).
Ruby only thinks it's holding onto a tiny amount of memory, so it won't bother running the GC. In actual fact it's holding onto 30 meg.
If you loop over a say 10 images, you can run yourself out of memory really fast.
The preferred solution is to manually tell the C library to clean up the memory itself - RMagick has a destroy! method which does this. If your library doesn't however, you may need to forcibly run the GC yourself, even though this is generally discouraged.
(1): Ruby C extensions have callbacks which will get run when the ruby runtime decides to free them, so the memory will eventually be successfully freed at some point, just perhaps not soon enough.
Measure and detect which parts of your code are creating objects that cause memory usage to go up. Improve and modify your code then measure again. Sometimes, you're using gems or libraries that use up a lot of memory and creating a lot of objects as well.
There are many tools out there such as busy-administrator that allow you to check the memory size of objects (including those inside hashes and arrays).
$ gem install busy-administrator
Example # 1: MemorySize.of
require 'busy-administrator'
data = BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
puts BusyAdministrator::MemorySize.of(data)
# => 10 MiB
Example # 2: MemoryUtils.profile
Code
require 'busy-administrator'
results = BusyAdministrator::MemoryUtils.profile(gc_enabled: false) do |analyzer|
BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
end
BusyAdministrator::Display.debug(results)
Output:
{
memory_usage:
{
before: 12 MiB
after: 22 MiB
diff: 10 MiB
}
total_time: 0.406452
gc:
{
count: 0
enabled: false
}
specific:
{
}
object_count: 151
general:
{
String: 10 MiB
Hash: 8 KiB
BusyAdministrator::MemorySize: 0 Bytes
Process::Status: 0 Bytes
IO: 432 Bytes
Array: 326 KiB
Proc: 72 Bytes
RubyVM::Env: 96 Bytes
Time: 176 Bytes
Enumerator: 80 Bytes
}
}
You can also try ruby-prof and memory_profiler. It is better if you test and experiment different versions of your code so you can measure the memory usage and performance of each version. This will allow you to check if your optimization really worked or not. You usually use these tools in development / testing mode and turn them off in production.

Resources