Optimizing performance on Erlang processes - performance

In a test I'm building here my goal is to create a parser. So I've built a concept proof that reads all messages from a file, and after pushing all of them to memory I'm spawning one process to parse each message. Until that, everything is fine, and I've got some nice results. But I could see that the erlang VM is not using all my processor power (I have a quad core), in fact it is using about 25% percent of my processor when doing my test. I've made a counter-test using c++ that uses four threads and obviously it is using 100% thus producing a better result (I've respected the same queue model erlang uses).
So I'm wondering what could be "slowing" my erlang test? I know it's not a serialization matter as I'm spawning one process per message. One thing I've thought is that maybe my message is too small (about 10k each), and so making that much of processes is not helping achieve a great performance.
Some facts about the test:
106k messages
On erlang (25% processor power used) - 204 msecs
On my C++ test (100% processor power used) - 80 msecs
Yes the difference isn't that great but if there is more power available certainly there is more room for improvement, right?
Ah, I've done some profilling and wasn't able to find another way to optimize, since there are few function calls and most of them are string to object convertion.
Update:
Woooow! Following Hassan Syed idea, I've managed to achieve 35 msecs against 80 from c++! This is awesome!

It seems your erlang VM is using only one core.
Try starting it like this:
erl -smp enable +S 4
The -smp enable flag tells Erlang to start the runtime system with SMP support enabled
With +S 4 you start 4 Erlang schedulers (1 for each core)
You can see if you have SMP enabled when you start the shell:
Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)
1>
[smp:2:2] tells it is running with smp enabled 2 schedulers 2 schesulers online

If you have once source file and you spawn one process per "expression" you really do not understand when to parallelise. It costs FAR more to spawn and process and process an expression than to just have one process to process an entire file. A suitable strategy would be to have one process per file rather than one process per expression.
Another alternative strategy would be to split the file in two,three or x chunks, and process those chunks. This of course assumes the source isn't linearly dependant and the chunks' processing time needs to exceed the time to create and spawn a process (ussualy by far, because time waste in process X is time taken away from the rest of the machine).
-- Discussion C++ vs Erlang and your findings --
Erlang has a user-space kernel that emulates a lot of the primitives of the OS kernel. Especially the scheduler and blocking primitives. This means that there is some overhead when comparing the same strategy used in a procedural raw language such as C++. You must tune your task partitioning to every entry from the implementation space (CPU/memory/OS/programming language) according to its properties.

You should bind the schedulers to the CPU cores:
erlang:system_flag(scheduler_bind_type, processor_spread).

Related

Using a precompiled version of Vowpal Wabbit - Downsides?

Due to the difficulty of compiling VW on a RHEL machine, I am opting out to use a compiled versions of VW provided by Ariel Faigon (thank you!) here. I'm calling VW from Python, so I am planning on using Python's subprocess module (I couldn't get the python package to compile either). I am wondering if there would be any downsides to this approach. Would I see any performance lags?
Thank you so much for your help!
Feeding a live vowpal wabbit process via Python's subprocess is ok (fast). as long as you don't start a new process per example and avoid excessive context switches. In my experience, in this set up, you can expect a throughput of ~500k features per second on typical dual-core hardware. This is not as fast as the (10x faster) ~5M features/sec vw typically processes when not interacting with any other software (reading from file/cache), but is good enough for most practical purposes. Note that the bottleneck in this setting would most likely be the processing by the additional process, not vowpal-wabbit itself.
It is recommended to feed vowpal-wabbit in batches (N examples at a time, instead of one at a time) both on input (feeding vw) and on output (reading vw responses). If you're using subprocess.Popen to connect to the process, make sure to pass a large bufsize otherwise by default the Popen iterator would be line-buffered (one example at a time) which might result in a per example context-switch between the producer of examples and consumer (vowpal wabbit).
Assuming your vw command line is in vw_cmd, it would be something like:
vw_proc = subprocess.Popen(vw_cmd,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
bufsize=1048576)
Generally, slowness can come from:
Too many context switches (generating and processing one example at a time)
Too much processing outside vw (e.g. generating the examples in the first place, feature transformation)
Startup overhead (e.g. reading the model) per example.
So avoiding all the above pitfalls should give you the fastest throughput possible under the circumstances of having to interact with additional processes.

Ocaml parallelize builds

Let's say I want to build a program that compiles Ocaml source code in parallel. I actually want to understand how to accomplish this with Ocaml today. So given the current state of Ocaml today, how do I parallelize parts of my program?
Should I just spawn new processes with the Unix module?
In the case of parallel compilation, does this have any overhead/performance impact?
To compile files you have to run the ocaml compiler, one for each file. So yes, you have to have to start new processes and the Unix module has the necessary functionality for that.
As for overhead or performance impacts consider this:
1) You need to start one process per file you compile. Weather you do that sequentially or parallel the number of processes started remains the same.
2) Compiling a file takes long compared to the necessary bookkeeping of starting each compile, even if you consider parallel compilations.
So are you worried about the overhead of starting and tracking multiple compiler processes in parallel? Don't be. That's less than 0.1% of your time. On the other hand utilizing 2 cpu cores by running 2 compilers will basically double your speed.
Or are you worried about multiple compilers running in parallel? You need twice the ram and on most modern cpus caches are shared so cache performance will suffer too to some extend. But unless you are working on some embedded system that is absolutely memory starved that won't be a problem. So again the benefit of using multiple cores far outweigh the drawbacks.
Just follow the simple rule for running things parallel: Run one job per CPU core with maybe one extra to smooth out IO waits. More jobs will have them fight for CPU time without benefits.

How to get concurrent function (pmap) to use all cores in Elixir?

I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

Why does my Perl script to decompress files slower when I use threads?

So I'm running perl 5.10 on a core 2 duo macbook pro compiled with threading support: usethreads=define, useithreads=define. I've got a simple script to read 4 gzipped files containing aroud 750000 lines each. I'm using Compress::Zlib to do the uncompressing and reading of the files. I've got 2 implementations the only difference between them being one includes use threads. Other than that both script run the same subroutine to do the reading. Hence in psuedocode the non-threading program does this:
read_gzipped(file1);
read_gzipped(file2);
read_gzipped(file3);
read_gzipped(file4);
The threaded version goes like this:
my thr0 = threads->new(\$read_gzipped,'file1')
my thr1 = threads->new(\$read_gzipped,'file1')
my thr2 = threads->new(\$read_gzipped,'file1')
my thr3 = threads->new(\$read_gzipped,'file1')
thr0->join()
thr1->join()
thr2->join()
thr3->join()
Now the threaded version is actually running almost 2 times slower then the non-threaded script. This obviously was not the result I was hoping for. Can anyone explain what I'm doing wrong here?
You're using threads to try and speed up something that's IO-bound, not CPU-bound. That just introduces more IO contention, which slows down the script.
My guess is the bottleneck for GZIP operations is disk access. If you have four threads competing for disk access on platter harddisk, that slows things down considerably. The disk head will have to move to different files in rapid succession. If you just process one file at a time, the head can stay near that file, and the disk cache will be more accurate.
ithreads work well if you're dealing with something which is mostly not cpu bound. decompression is cpu bound.
You can easily alleviate the problem with using Parallel::ForkManager module.
Generally - threads in Perl and not really good.
I'm not prepared to assume that you're I/O bound without seeing the output of top while this is running. Like depesz, I tend to assume that compression/decompression operations (which are math-heavy) are more likely to be CPU-bound.
When you're dealing with a CPU-bound operation, using more threads/processes than you have processors will almost never[1] improve matters - if the CPU utilization is already at 100%, more threads/processes won't magically increase its capacity - and will most likely make things worse by adding in more context-switching overhead.
[1] I've heard it suggested that heavy compilations, such as building a new kernel, benefit from telling make to use twice as many processes as the machine has processors and my personal experience has been that this seems to be accurate. The explanation I've heard for it is that this allows each CPU to be kept busy compiling in one process while the other process is waiting for data to be fetched from main memory. If you view compiling as a CPU-bound process, this is an exception to the normal rule. If you view it as an I/O bound case (where the I/O is between the CPU and main memory rather than disk/network/user I/O), it is not.

Resources