Avoiding cPickle in Ipython's parallel

Avoiding cPickle in Ipython's parallel - parallel-processing

I have some code that I have paralleled successfully in the sense that it gets an answer, but it is still kind of slow. Using cProfile.run(), I found that 121 seconds (57% of total time) were spent in cPickle.dumps despite a per call time of .003. I don't use this function anywhere else, so it must be occurring due to ipython's parallel.
The way my code works is it does some serial stuff, then runs many simulations in parallel. Then some serial stuff, then a simulation in parallel. It has to repeat this many, many times. Each simulation requires a very large dictionary that I pull in from a module I wrote. I believe this is what is getting pickled many times and slowing the program down.
Is there a way to push a large dictionary to the engines in such a way that it stays there permanently? I think it's getting physically pushed every time I call the parallel function.

Related

Parallel code slower than serial code (value function iteration example)

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!

If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

Runtime of GPU-based simulation unexplainable?

I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
CASE A:
Normal total runtime for all iterations is 490ms.
CASE B:
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.

IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).

Would threading be beneficial for this situation?

I have a CSV file with over 1 million rows. I also have a database that contains such data in a formatted way.
I want to check and verify the data in the CSV file and the data in the database.
Is it beneficial/reduces time to thread reading from the CSV file and use a connection pool to the database?
How well does Ruby handle threading?
I am using MongoDB, also.

It's hard to say without knowing some more details about the specifics of what you want the app to feel like when someone initiates this comparison. So, to answer, some general advice that should apply fairly well regardless of the problem you might want to thread.
Threading does NOT make something computationally less costly
Threading doesn't make things less costly in terms of computation time. It just lets two things happen in parallel. So, beware that you're not falling into the common misconception that, "Threading makes my app faster because the user doesn't wait for things." - this isn't true, and threading actually adds quite a bit of complexity.
So, if you kick off this DB vs. CSV comparison task, threading isn't going to make that comparison take any less time. What it might do is allow you to tell the user, "Ok, I'm going to check that for you," right away, while doing the comparison in a separate thread of execution. You still have to figure out how to get back to the user when the comparison is done.
Think about WHY you want to thread, rather than simply approaching it as whether threading is a good solution for long tasks
Like I said above, threading doesn't make things faster. At best, it uses computing resources in a way that is either more efficient, or gives a better user experience, or both.
If the user of the app (maybe it's just you) doesn't mind waiting for the comparison to run, then don't add threading because you're just going to add complexity and it won't be any faster. If this comparison takes a long time and you'd rather "do it in the background" then threading might be an answer for you. Just be aware that if you do this you're then adding another concern, which is, how do you update the user when the background job is done?
Threading involves extra overhead and app complexity, which you will then have to manage within your app - tread lightly
There are other concerns as well, such as, how do I schedule that worker thread to make sure it doesn't hog the computing resources? Are the setting of thread priorities an option in my environment, and if so, how will adjusting them affect the use of computing resources?
Threading and the extra overhead involved will almost definitely make your comparison take LONGER (in terms of absolute time it takes to do the comparison). The real advantage is if you don't care about completion time (the time between when the comparison starts and when it is done) but instead the responsiveness of the app to the user, and/or the total throughput that can be achieved (e.g. the number of simultaneous comparisons you can be running, and as a result the total number of comparisons you can complete within a given time span).
Threading doesn't guarantee that your available CPU cores are used efficiently
See Green Threads vs. native threads - some languages (depending on their threading implementation) can schedule threads across CPUs.
Threading doesn't necessarily mean your threads wind up getting run in multiple physical CPU cores - in fact in many cases they definitely won't. If all your app's threads run on the same physical core, then they aren't truly running in parallel - they are just splitting CPU time in a way that may make them look like they are running in parallel.
For these reasons, depending on the structure of your app, it's often less complicated to send background tasks to a separate worker process (process, not thread), which can easily be scheduled onto available CPU cores at the OS level. Separate processes (as opposed to separate threads) also remove a lot of the scheduling concerns within your app, because you essentially offload the decision about how to schedule things onto the OS itself.
This last point is pretty important. OS schedulers are extremely likely to be smarter and more efficiently designed than whatever algorithm you might come up with in your app.

Elimination of run time variation over repeated executions of the same program

I am trying to design an Online Programming Contest Judge, and one of the things that I need to ensure is that when the same code is compiled (assuming the requirement),
given the same input, it should take exactly the same amount of time for the program to execute, each time this is done.
Currently, I am using a simple python script that
has 2 threads, one of which invokes a blocking system call that starts the execution of the test code, and the other keeps track of time and sends a kill signal to the
child process after the time limit expires. Incidentally, I am doing this inside a virtual machine for reason of security, and convenience (setting up a proper chroot is
way too complicated, and more risky).
However, given identical conditions (ie, when I restore a snapshot), I still get a variation in the time taken for execution in range of approximately 50ms on either side. As this prevents setting strict time limits, is there anyway to eliminate this variation?

I'm not an expert in that field, but I don't think you can do it. Even if you restore the snapshot inside the VM, the state of the "Outside" Machine is going to be pretty different. You have two OSs running, each one which multiple process which are probably going to compete for the resources at some point. If it's a website or a PC with an internet connection, you can get hit by different amounts of connections (or request), and that will make process start running and consume requests etc... If some application tries to access the hard disk, the initial position of the physical disk matters a lot for seek time, etc...
If you want a "deterministic" limit, you might wanna check if you can count how many instructions were executed by a certain process, or something like that.
Anyways, I've participated in several programming contents, and as far as I know, they don't care about the 50 ms differences... If you do a proper algorithm, you can get inside the time with a really big margin. So I'd advise you to live with it, and just include that in the rules.

Give all possible resources to a program

I created a program in C# to work with 2.5 million records in Oracle Express (local instance), parse/split those records and create an additional 5 million records.
I added some code to print times on the screen and it seems fairly fast. It is doing all the processing for 1K records every 9 seconds. Which means it takes more than 6 hours to finish.
Now, with Task Manager I can see the program is using 6% of CPU (max) and around 50MB of memory. I understand the OS, and Oracle itself need resources to operate but..... is there a way to tell this little program "hey, it's ok, go ahead and use at least 50% of CPU, there are 4GB of RAM so knock yourself out"?
Note: One of the reasons I'm using a local instance with Oracle Express is to reduce the network bottleneck. Also I might not run this process quite often but I was intrigued to see if this was at all possible.
Please forgive my noobness,
Thanks!

The operating system will give your program all the resources it needs, the reason your process is not consuming all the CPU is probably because it's waiting for the IO sub system more than the processor.
If you want to see if you can consume more CPU cycles try writing a program that runs a short infinite loop as fast as possible and you will see the difference in CPU usage.

A number of thoughts, not really answers I guess, but.
You could up the priority of the applications thread, however, its possible that the code maybe less efficient than you think, so..
Have you run a profiler on it?
If its currently a single threaded app, you could look to see if you could parse it in batches and therefore run them in parallel.
Without knowing a lot of detail of the splitting of records, is it possible to off hand that more to oracle to do? eg, would matter less about network etc or local or otherwise.
If you're apps drawing/updating a screen or UI then it will almost certainly slow the progress of the work down. An example. I ran an app which sorted about 10k emails into around 250k lines into a database, if I added an item to a listbox each line the time went from short to rediculous eg, crash out got bored. So, again, offloading to a thread to do the work with as few UI updates to do as possible can help.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio