Is there a way to spread xz compression efforts across multiple CPU's? I realize that this doesn't appear possible with xz itself, but are there other utilities that implement the same compression algorithm that would allow more efficient processor utilization? I will be running this in scripts and utility apps on systems with 16+ processors and it would be useful to at least use 4-8 processors to potentially speed up compression rates.
Multiprocessor (multithreading) compression support was added to xz in version 5.2, in December 2014.
To enable the functionality, add the -T option, along with either the number of worker threads to spawn, or -T0 to spawn as many CPU's as the OS reports:
xz -T0 big.tar
xz -T4 bigish.tar
The default single threaded operation is equivalent to -T1.
I have found that running it with a couple of hyper-threads less than the total number of hyperthreads on my CPU† provides a good balance of responsiveness and compression speed.
† So -T10 on my 6 core, 12 thread workstation.
As scai and Dzenly said in comments
If you want to use this in combination with tar just call export XZ_DEFAULTS="-T 0" before.
or use smth like: XZ_OPT="-2 -T0"
Related
I'm looking for a reliable way to measure quantitatively the CPU overhead introduced by my shared library. The library is loaded in the context of php-fpm process (as a PHP extension). Ultimately, I'd like to run a series of tests for two versions of the .so library, collect stats and compare it, to see that to what extent the current version of library is better/worse than the previous one.
I have already tried a few approaches.
"perf stat" connected to PHP-FPM process and measuring cpu-cycles, instructions, and cpu-time
"perf record" connected to PHP-FPM process and collecting CPU cycles. Then I extract data from the collected perf.data. I consider only data related to my .so file and all-inclusive invocations (itself + related syscalls and kernel). So I can get CPU overhead for .so (inclusively).
valgrind on a few scripts running from CLI measuring "instruction requests".
All three options work but don't provide a reliable way for overhead comparison due to deviations (it might be up to 30% error, which is not applicable). It tried to do multiple runs and calculate the average, but the accuracy of that is also questionable.
Valgrind is the most accurate among all, but it provides only the "instructions" which do not give actual CPU overhead. Perf is better (considering the cycles, cpu-time, and instructions), but it gives too high errors from run to run.
Have anyone got experience in similar tasks? Could you recommend other approaches or Linux profilers to measure overhead accurately and quantitatively?
Thanks!
I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?
You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.
Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.
In Make this flag exists:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-point number). With no argument, removes a previous load limit.
Do you have a good strategy for what value to use for the load limit ? It seems to differ a lot between my machines.
Acceptable load depends on the number of CPU cores. If there is one core, than load average more than 1 is overload. If there are four cores, than load average of more than four is overload.
People often just specify the number of cores using -j switch.
See some empirical numbers here: https://stackoverflow.com/a/17749621/412080
I recommend against using the -l option.
In principle, -l seems superior to -j. -j says, start this many jobs. -l says, make sure this many jobs are running. Often, those are almost the same thing, but when you have I/O bound jobs are other oddities, then -l should be better.
That said, the concept of load average is a bit dubious. It is necessarily a sampling of what goes on on the system. So if you run make -j -l N (for some N) and you have a well-written makefile, then make will immediately start a large number of jobs and run out of file descriptors or memory before even the first sample of the system load can be taken. Also, the accounting of the load average differs across operating systems, and some obscure ones don't have it at all.
In practice, you'll be as well off using -j and will have less headaches. To get more performance out of the build, tune your makefiles, play with compiler options, and use ccache or similar.
(I suspect the original reason for the -l option stems from a time when multiple processors were rare and I/O was really slow.)
I have a more general question, regarding timing in a standard Linux-OS dealing with playing sound and receiving data over a serial port.
In the moment, I'm reading a PCM-Signal arriving over a USB-to-Serial Bridge (pl2303) which is recorded, encoded and sent from a FPGA.
Now, I need to create "peaks" at a known position in the recorded soundstream, and plan to play a soundfile from the same machine which is recording at a known moment. The peak has to begin and stop inside windows of maximal 50ms, it's length could be ~200ms...
Now, my question is: How precise can I expect the timing to be? I know, that several components add "unkown lag", jitter:
USB-to-Serial Bridge collects ~20 bytes from the serial side before sending them to the USB-side (at 230400Baud this results in ~1ms)
If I call "`sleep 1; mpg123 $MP3FILE` &" directly before my recording software, the Linux-Kernel will schedule them differenty (maybe this causes a few 10ms, depending on system load?)
The soundcard/driver will maybe add some more unkown lag...
Will tricks like "nice" or "sched_setscheduler" add value in my case?
I could build an additional thread inside my recording sofware, which plays the sound. Doing this, the timing may be more precise, but I have a lot more work to do ...
Thanks a lot.
I will try it anyway, but I'm looking for some background toughts to understand and solve my upcoming problems better.
I am not 100% sure, but I would imagine that your kernel would need to be rebuilt to allow the scheduler to reduce latency time in switching tasks a la multitasking, in kernels 2.6.x series, there's an option to make the kernel more smoother by making it pre-emptible.
Go to Processor Type and features
Pre-emption Model
Select Preemptible kernel (low latency desktop)
This should streamline the timing and make the sounds appear smoother as a result of less jitters.
Try that and recompile the kernel again. There are of course, plenty of kernel patches that would reduce the timeslice for each task-switch to make it even more smoother, your mileage may vary depending on:
Processor speed - what processor is used?
Memory - how much RAM?
Disk input/output - the faster, the merrier
using those three factors combined, will have an influence on the scheduler and the multi-tasking features. The lower the latency, the more fine-grained it is.
Incidentally, there is a specialized linux distribution that is catered for capturing sound in real-time, I cannot remember the name of it, the kernel in that distribution was heavily patched to make sound capture very smooth.
it's me again... After one restless night, I solved my strange timing-problems... My first edit is not completely correct, since what I posted was not 100% reproducible. After running some more tests, I can come up with the following Plot, showing timing accuracy:
Results from analysis http://mega2000.de/~mzenzes/pics4web/2010-05-28_13-37-08_timingexperiment.png
I tried two different ubuntu-kernels: 2.6.32-21-generic and 2.6.32-10-rt
I tried to achieve RT-scheduling: sudo chrt --fifo 99 ./experimenter.sh
And I tried to change powersaving-options: echo {performance,conservative} | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
This resulted in 8 different tests, with 50 runs each. Here are the numbers:
mean(peakPos) std(peakPos)
rt-kernel-fifo99-ondemand 0.97 0.0212
rt-kernel-fifo99-performance 0.99 0.0040
rt-kernel-ondemand 0.91 0.1423
rt-kernel-performance 0.96 0.0078
standard-kernel-fifo99-ondemand 0.68 0.0177
standard-kernel-fifo99-performance 0.72 0.0142
standard-kernel-ondemand 0.69 0.0749
standard-kernel-performance 0.69 0.0147
In a test I'm building here my goal is to create a parser. So I've built a concept proof that reads all messages from a file, and after pushing all of them to memory I'm spawning one process to parse each message. Until that, everything is fine, and I've got some nice results. But I could see that the erlang VM is not using all my processor power (I have a quad core), in fact it is using about 25% percent of my processor when doing my test. I've made a counter-test using c++ that uses four threads and obviously it is using 100% thus producing a better result (I've respected the same queue model erlang uses).
So I'm wondering what could be "slowing" my erlang test? I know it's not a serialization matter as I'm spawning one process per message. One thing I've thought is that maybe my message is too small (about 10k each), and so making that much of processes is not helping achieve a great performance.
Some facts about the test:
106k messages
On erlang (25% processor power used) - 204 msecs
On my C++ test (100% processor power used) - 80 msecs
Yes the difference isn't that great but if there is more power available certainly there is more room for improvement, right?
Ah, I've done some profilling and wasn't able to find another way to optimize, since there are few function calls and most of them are string to object convertion.
Update:
Woooow! Following Hassan Syed idea, I've managed to achieve 35 msecs against 80 from c++! This is awesome!
It seems your erlang VM is using only one core.
Try starting it like this:
erl -smp enable +S 4
The -smp enable flag tells Erlang to start the runtime system with SMP support enabled
With +S 4 you start 4 Erlang schedulers (1 for each core)
You can see if you have SMP enabled when you start the shell:
Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)
1>
[smp:2:2] tells it is running with smp enabled 2 schedulers 2 schesulers online
If you have once source file and you spawn one process per "expression" you really do not understand when to parallelise. It costs FAR more to spawn and process and process an expression than to just have one process to process an entire file. A suitable strategy would be to have one process per file rather than one process per expression.
Another alternative strategy would be to split the file in two,three or x chunks, and process those chunks. This of course assumes the source isn't linearly dependant and the chunks' processing time needs to exceed the time to create and spawn a process (ussualy by far, because time waste in process X is time taken away from the rest of the machine).
-- Discussion C++ vs Erlang and your findings --
Erlang has a user-space kernel that emulates a lot of the primitives of the OS kernel. Especially the scheduler and blocking primitives. This means that there is some overhead when comparing the same strategy used in a procedural raw language such as C++. You must tune your task partitioning to every entry from the implementation space (CPU/memory/OS/programming language) according to its properties.
You should bind the schedulers to the CPU cores:
erlang:system_flag(scheduler_bind_type, processor_spread).