Julia: serialize error when sending large objects to workers - parallel-processing

I am trying to send data from the master process to worker processes. I am able to do so just fine with relatively small pieces of data. But, as soon as they get above a certain size, I encounter a serialize error.
Is there a way to resolve this, or would I just need to break my objects down into smaller pieces and then reassemble them on the workers? If so, is there a good way to determine ahead of time the max size that I can send (which I suppose may be dependent upon system variables)? Below is code showing a transfer that works and one that fails. It's possible the sizes might need to be tinkered with to reproduce on other systems.
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
X1 = rand(10^5, 10^3);
X2 = rand(10^6, 10^3);
sendto(2, X1 = X1) ## works fine
sendto(2, X2 = X2)
ERROR: write: invalid argument (EINVAL)
in yieldto at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in wait at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in stream_wait at /Applications/Julia-0.4.6.app/Contents/Resources/julia/lib/julia/sys.dylib
in uv_write at stream.jl:962
in buffer_or_write at stream.jl:982
in write at stream.jl:1011
in serialize_array_data at serialize.jl:164
in serialize at serialize.jl:181
in serialize at serialize.jl:127
in serialize at serialize.jl:310
in serialize_any at serialize.jl:422
in send_msg_ at multi.jl:222
in remotecall at multi.jl:726
in sendto at none:3
Note: I have plenty of system memory, even for two copies of the larger object, so the problem isn't in that.

This issue seems to be resolved now with Julia 0.5.

Related

Parallel for loop in Julia

I am aware there are a multitude of questions about running parallel for loops in Julia, using #threads, #distributed, and other methods. I have tried to implement the solutions there with no luck. The structure of what I'd like to do is as follows.
for index in list_of_indices
data = h5read("data_set_$index.h5")
result = perform_function(data)
save(result)
end
The data sets are independent, and no part of this loop depends on any other. It seems this should be parallelizable.
I have tried, e.g.,
"#threads for index in list_of_indices..." and I get a segmentation error
"#distributed for index in list_of_indices..." and the code does not actually perform the function on my data.
I assume I'm missing something about how parallel processes work, and any insight would be appreciated.
Here is a MWE:
Assume we have files data_1.h5, data_2.h5, data_3.h5 in our working directory. (I don't know how to make things more self-contained than this because I think the problem is arising from asking multiple threads to read files.)
using Distributed
using HDF5
list = [1,2,3]
Threads.#threads for index in list
data = h5read("data_$index.h5", "data")
println(data)
end
The error I get is
signal (11): Segmentation fault
signal (6): Aborted
Allocations: 1587194 (Pool: 1586780; Big: 414); GC: 1
Segmentation fault (core dumped)
As noted by other people there is no enough details. However, given the current state of information the safest code that has the highest chance to work is:
using Distributed
addprocs(4)
#everywhere using HDF5
list = [1,2,3]
#sync #distributed for index in list
data = h5read("data_$index.h5", "data")
println(data)
end
Distributed approach separates processes completely and hence you have much lesser chance to do something wrong (eg. use a library with a shared resource etc).

MPI rank is changed after MPI_SENDRECV call [duplicate]

This question already has an answer here:
MPI_Recv overwrites parts of memory it should not access
(1 answer)
Closed 3 years ago.
I have some Fortran code that I'm parallelizing with MPI which is doing truly bizarre things. First, there's a variable nstartg that I broadcast from the boss process to all the workers:
call mpi_bcast(nstartg,1,mpi_integer,0,mpi_comm_world,ierr)
The variable nstartg is never altered again in the program. Later on, I have the boss process send eproc elements of an array edge to the workers:
if (me==0) then
do n=1,ntasks-1
(determine the starting point estart and the number eproc
of values to send)
call mpi_send(edge(estart),eproc,mpi_integer,n,n,mpi_comm_world,ierr)
enddo
endif
with a matching receive statement if me is non-zero. (I've left out some other code for readability; there's a good reason I'm not using scatterv.)
Here's where things get weird: the variable nstartg gets altered to n instead of keeping its actual value. For example, on process 1, after the mpi_recv, nstartg = 1, and on process 2 it's equal to 2, and so forth. Moreover, if I change the code above to
call mpi_send(edge(estart),eproc,mpi_integer,n,n+1234567,mpi_comm_world,ierr)
and change the tag accordingly in the matching call to mpi_recv, then on process 1, nstartg = 1234568; on process 2, nstartg = 1234569, etc.
What on earth is going on? All I've changed is the tag that mpi_send/recv are using to identify the message; provided the tags are unique so that the messages don't get mixed up, this shouldn't change anything, and yet it's altering a totally unrelated variable.
On the boss process, nstartg is unaltered, so I can fix this by broadcasting it again, but that's hardly a real solution. Finally, I should mention that compiling and running this code using electric fence hasn't picked up any buffer overflows, nor did -fbounds-check throw anything at me.
The most probable cause is that you pass an INTEGER scalar as the actual status argument to MPI_RECV when it should be really declared as an array with an implementation-specific size, available as the MPI_STATUS_SIZE constant:
INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status
or
INTEGER status(MPI_STATUS_SIZE)
The message tag is written to one of the status fields by the receive operation (its implementation-specific index is available as the MPI_TAG constant and the field value can be accessed as status(MPI_TAG)) and if your status is simply a scalar INTEGER, then several other local variables would get overwritten. In your case it simply happens so that nstartg falls just above status in the stack.
If you do not care about the receive status, you can pass the special constant MPI_STATUS_IGNORE instead.

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Poor performance / lockup with STM

I'm writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan is deprecated I decided to use TChan's as advertised.
The performance of TChan is much worse than I expected. I have the following program that illustrates the issue:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Concurrent.STM
import Control.Concurrent
import System.Random(randomRIO)
import Control.Monad(forever, when)
allCoords :: [(Int,Int)]
allCoords = [(x,y) | x <- [0..99], y <- [0..99]]
randomCoords :: IO (Int,Int)
randomCoords = do
x <- randomRIO (0,99)
y <- randomRIO (0,99)
return (x,y)
main = do
chan <- newTChanIO :: IO (TChan ((Int,Int),Int))
let watcher p = do
chan' <- atomically $ dupTChan chan
forkIO $ forever $ do
r#(p',_counter) <- atomically $ readTChan chan'
when (p == p') (print r)
return ()
mapM_ watcher allCoords
let go !cnt = do
xy <- randomCoords
atomically $ writeTChan chan (xy,cnt)
go (cnt+1)
go 1
When compiled (-O) and run the program first will output something like this:
./tchantest
((0,25),341)
((0,33),523)
((0,33),654)
((0,35),196)
((0,48),181)
((0,48),446)
((1,15),676)
((1,50),260)
((1,78),561)
((2,30),622)
((2,38),383)
((2,41),365)
((2,50),596)
((2,57),194)
((3,19),259)
((3,27),344)
((3,33),65)
((3,37),124)
((3,49),109)
((3,72),91)
((3,87),637)
((3,96),14)
((4,0),34)
((4,17),390)
((4,73),381)
((4,74),217)
((4,78),150)
((5,7),476)
((5,27),207)
((5,47),197)
((5,49),543)
((5,53),641)
((5,58),175)
((5,70),497)
((5,88),421)
((5,89),617)
((6,0),15)
((6,4),322)
((6,16),661)
((6,18),405)
((6,30),526)
((6,50),183)
((6,61),528)
((7,0),74)
((7,28),479)
((7,66),418)
((7,72),318)
((7,79),101)
((7,84),462)
((7,98),669)
((8,5),126)
((8,64),113)
((8,77),154)
((8,83),265)
((9,4),253)
((9,26),220)
((9,41),255)
((9,63),51)
((9,64),229)
((9,73),621)
((9,76),384)
((9,92),569)
...
And then, at some point, will stop writing anything, while still consuming 100% cpu.
((20,56),186)
((20,58),558)
((20,68),277)
((20,76),102)
((21,5),396)
((21,7),84)
With -threaded the lockup is even faster and occurs after only a handful of lines. It will also consume whatever number of cores are made available through RTS' -N flag.
Additionally the performance seems rather poor - only about 100 events per second are processed.
Is this a bug in STM or am I misunderstanding something about semantics of STM?
The program is going to perform quite badly. You're spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they're all going, you may well get this happening:
Each of the 10,000 threads tries to read from the channel, finds it empty, and adds itself to the wait queue for the underlying TVar. So you'll have 10,000 queue-up events, and 10,000 processes in the wait queue for the TVar.
Something is written to the channel. This will unqueue each of the 10,000 threads and put it back on the run-queue (this may be O(N) or O(1), depending on how the RTS is written).
Each of the 10,000 threads must then process the item to see if it's interested in it, which most won't be.
So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn't seem so unreasonable. I don't understand why the program would grind to a complete halt, though.
In general, I would scrap this design and replace it as follows:
Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf
This is a great test case! I think you've actually created a rare instance of genuine livelock/starvation. We can test this by compiling with -eventlog and running with -vst or by compiling with -debug and running with -Ds. We see that even as the program "hangs" the runtime still is working like crazy, jumping between blocked threads.
The high-level reason is that you have one (fast) writer and many (fast) readers. The readers and writer both need to access the same tvar representing the end of the queue. Let's say that nondeterministically one thread succeeds and all others fail when this happens. Now, as we increase the number of threads in contention to 100*100, then the probability of the reader making progress rapidly goes towards zero. In the meantime, the writer in fact takes longer in its access to that tvar than do the readers, so that makes things worse for it.
In this instance, putting a tiny throttle between each invocation of go for the writer (say, threadDelay 100) is enough to fix the problem. It gives the readers enough time to all block between successive writes, and so eliminates the livelock. However, I do think that it would be an interesting problem to improve the behavior of the runtime scheduler to deal with situations like this.
Adding to what Neil said, your code also has a space leak (noticeable with smaller n): After fixing the obvious tuple build-up issue by making tuples strict, I was left with the following profile: What's happening here, I think, is that the main thread is writing data to the shared TChan faster than the worker threads can read it (TChan, like Chan, is unbounded). So the worker threads spend most of their time reexecuting their respective STM transactions, while the main thread is busy stuffing even more data into the channel; this explains why your program hangs.

Potential Memory Leak in my wxPython App

I am pretty sure I am suffering from memory leakage, but I havent 100% nailed down how its happening.
The application Iv'e written downloads 2 images from a url and queues each set of images, called a transaction, into a queue to be popped off by the user interface and displayed. The images are pretty big, averaging about 2.5MB. So as a way of speeding up the user interface and making it more responsive, I pre-load each transaction images into wxImage objects and store them.
When the user pops off another transaction, I feed the preloaded image into a window object that then converts the wxImage into a bitmap and DC blits to the window. The window object is then displayed on a panel.
When the transaction is finished by the user, I destroy the window object (presumably the window goes away, as does the bitmap) and the transaction data structure is overwritten with 'None'.
However, depending on how many images ive preloaded, whether the queue size is set large and its done all at once, or whether I let a small queue size sit over time, it eventually crashes. I really cant let this happen .. :)
Anyone see any obvious logical errors in what im doing? Does python garbage collect? I dont have much experience with having to deal with memory issues.
[edit] here is the code ;) This is the code related to the thread that downloads the images - it is instanced in the main thread the runs the GUI - the download thread's main function is the 'fill_queue' function:
def fill_queue(self):
while True:
if (self.len() < self.maxqueuesize):
try:
trx_data = self.download_transaction_data(self.get_url)
for trx in trx_data:
self.download_transaction_images(trx)
if self.valid_images([trx['image_name_1'], trx['image_name_2']]):
trx = self.pre_load_images(trx)
self.append(trx)
except IOError, error:
print "Received IOError while trying to download transactions or images"
print "Error Received: ", error
except Exception, ex:
print "Caught general exception while trying to download transactions or images"
print "Error Received: ", ex
else:
time.sleep(1)
def download_transaction_images(self, data):
""" Method will download all the available images for the provided transaction """
for(a, b) in data.items():
if (b) and (a == "image_name_1" or a == "image_name_2"):
modified_url = self.images_url + self.path_from_filename(b)
download_url = modified_url + b
local_filepath = self.cache_dir + b
urllib.urlretrieve(download_url, local_filepath)
urllib.urlcleanup()
def download_transaction_data(self, trx_location):
""" Method will download transaction data and return a parsed list of hash structures """
page = urllib.urlopen(trx_location)
data = page.readlines()
page.close()
trx_list = []
trx_data = {}
for line in data:
line = line.rstrip('|!\n')
if re.search('id=', line):
fields = re.split('\|', line)
for jnd in fields:
pairs = jnd.split('=')
trx_data[pairs[0]] = pairs[1]
trx_list.append(trx_data)
return trx_list
def pre_load_images(self, trx):
""" Method will create a wxImage and load it into memory to speed the image display """
path1 = self.cache_dir + trx['image_name_1']
path2 = self.cache_dir + trx['image_name_2']
image1 = wx.Image(path1)
image2 = wx.Image(path2)
trx['loaded_image_1'] = image1
trx['loaded_image_2'] = image2
return trx
def valid_images(self, images):
""" Method verifies that the image path is valid and image is readable """
retval = True
for i in images:
if re.search('jpg', i) or re.search('jpeg', i):
imagepath = self.cache_dir + i
if not os.path.exists(imagepath) or not wx.Image.CanRead(imagepath):
retval = False
else:
retval = False
return retval
Also, I'd like to add that sometimes, just before the crash I get peculiar errors in my console, they look like corrupt image errors but the images are not corrupted, the error has happened at all stages on all images.
Application transferred too few
scanlines [2009-09-08 11:12:03] Error:
JPEG: Couldn't load - file is probably
corrupted. [2009-09-08 11:12:11]
Debug: ....\src\msw\dib.cpp(134):
'CreateDIBSection' fail ed with error
0x00000000 (the operation completed
successfully.).
These errors can happen a la carte, or all together. What I think is happening is that at some point the memory becomes corrupted and anything that happens next, if I load a new transaction, or image, or do a cropping operation - it takes a dive.
So unfortunately after trying out the suggestion of moving the pre-loading function call to wxImage into the main gui thread I am still getting the error - again it will occur after too many images have been loaded into memory or if they sit in memory for too long. Then when I attempt to crop an image the i get a memory error - something is corrupting, whether in the former case I am using too much or dont have enough (which makes no sense because I've increased my paging file size to astronomical proportions) or in the latter case where the length of time is causing a leak or corruption
The only way I think I can go at this point is to use a debugger - are there any easy ways to debug a wxPython application? I would like to see the memory usage in particular.
The main reason why I think I need to preload the images is because if I call wxImage on each image ( I show two at a time) each time i load a 'transaction' the interface from one transaction to the next is very slow and clunky - If I load them in memory its very fast - but then I get my memory error.
Two thoughts:
You do not mention if the downloading is running a separate thread (actually now I see that this is running in a separate thread, I should read more closely). I'm pretty sure that wx.Image is not thread-safe, so if you are instantiating wx.Images in a non-GUI thread, that could lead to trouble like this. (This is almost certainly the issue, most wx classes/objects/functions are not thread-safe).
I've been bitten by nasty IncRef/DecRef bugs in wxPython (due to the underlying C++ bindings) before (mostly associated with wx.Grid and associated classes). While I don't know of any with wx.Image, it wouldn't surprise me to find out you may be required to manually manage memory like you have to in wx.Grid sometimes.
Edit
You need to instantiate the wx.Image in the GUI thread, not the downloading thread (which your above code looks like you are currently instantiating in the non-GUI thread). In general this is almost always going to cause lots of problems in any GUI toolkit. You can search the wxPython mailing list for lots of emails where this is the case. Personally I would do this:
Queue for download urls.
Thread to download images.
Have the downloading thread places a disk location (watch out for race conditions!) in a separate queue and post custom wx.Event(threadsafe) (threadsafe with wx.PostEvent function) to the App thread.
Have the GUI thread pop the file locations and instantiate wx.Image ----> wx.Bitmap (maybe with wx.CallAfter to process when App is idle)
Display (Blit) as needed.

Resources